We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CL

Change to browse by:

cs

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computation and Language

Title: CASICT Tibetan Word Segmentation System for MLWS2017

Authors: Jiawei Hu, Qun Liu
Abstract: We participated in the MLWS 2017 on Tibetan word segmentation task, our system is trained in a unrestricted way, by introducing a baseline system and 76w tibetan segmented sentences of ours. In the system character sequence is processed by the baseline system into word sequence, then a subword unit (BPE algorithm) split rare words into subwords with its corresponding features, after that a neural network classifier is adopted to token each subword into "B,M,E,S" label, in decoding step a simple rule is used to recover a final word sequence. The candidate system for submition is selected by evaluating the F-score in dev set pre-extracted from the 76w sentences. Experiment shows that this method can fix segmentation errors of baseline system and result in a significant performance gain.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:1710.06112 [cs.CL]
  (or arXiv:1710.06112v1 [cs.CL] for this version)

Submission history

From: Jiawei Hu [view email]
[v1] Tue, 17 Oct 2017 06:05:50 GMT (185kb,D)

Link back to: arXiv, form interface, contact.