The best way to reach us is by e-mail to: kovc, meokuro, and rdolan @cfar.umd.edu.
We are indebted to Dr. Dekai Wu of the HKUST, Human Language Technology Center, Department of Computer Science, Hong Kong for the following guidelines on segmentation extracted from his position statement on Chinese segmentation drafted at the Chinese Language Processing Workshop, University of Pennsylvania, Philadelphia, 30 June to 2 July 1998.
I Monotonicity Principle for segmentation A valid basic segmentation unit (segment or token) is a substring that no processing stage after the segmenter needs to decompose.
II A substring constitutes a valid segment only if no possible application would ever need to decompose it for any reason.
III A sharable general-purpose corpus should only commit to segments that no target application would ever need to decompose for any reason, whether structural or statistical.
IV The criteria for annotating corpus segments should not require presence in a reference dictionary or corpus.
V A general-purpose automatic segmenter should only commit to segments that no target application would ever need to decompose for any reason, whether structural or statistical.
VI An application-specific automatic segmenter should commit to the longest possible segments that will never need to be decomposed by later stages in the target application.
VII Application-specific segmentation is most accurately performed by task-driven segmentation, such as HKUST's parser and translator or BBN's named-entity recognizer.
VIII The Monotonicity Principle allows two approaches: 1) Do not mark "derivational" constructs as segments. 2) Mark "derivational" constructs as segments, subject to the following proviso. For segments not in the dictionary, the segment annotation should include the base form and type of derivational process.
IX Determinism Hypothesis for segmentation In sentence interpretation, humans employ a fast preprocessing segmentation stage that tokenizes input sentences into substrings that no later processing stage needs to decompose, except for garden paths.
II TAG GUIDELINES 2 July 1998
List of Parts of Speech with Corresponding Tag
To access The Little Grove 100 sentence corpus selected from Xinhua (New China News Agency) newswire stories click here. The Little Grove is the mini-corpus used for discussion at the Chinese Language Processing Workshop, University of Pennsylvania, Philadelphia, 30 June to 2 July 1998. .
Adjective--JJ
Words (词项) that occur exclusively in the attributive
position and cannot be modified by the adverb 很 are tagged
as adjectives (JJ). We borrow this definition from what the PRC Beijing University
part of speech guidelines call a 区别词 and what the Taiwan Academica Sinica
ROCLING part of speech guidelines call a 非谓形容词.
EXAMPLE: 国际/JJ 'international'
大型/JJ 'large scale'
共同/JJ 'common'
男/JJ 'male'
中央/JJ 'central
Adverb--RB
Words that modify dynamic and stative verbs and other adverbs are
tagged as adverbs (RB). This category includes the temporal adverbs,
intensifiers, and manner adverbalized stative verbs.
EXAMPLE: 现在/RB 'now'
目前/RB 'at present'
仍然/RB 'still yet'
很/RB 'very'
最/RB '-est'
大大/RB 'greatly, enormously'
不/RB 'not'
Aspect Particle--AS
Verbal particles that indicate aspect are tagged as aspect particles (AS).
This category includes 了-perfective aspect, 着-durative aspect,
起来-inchoative aspect, 过-indefinte past aspect, and delimitative aspect
with verb reduplication. Also included in this category is the durative
marker (在) when it immediately precedes a dynamic verb.
Coordinating conjunction--CC
Words that conjoin a construction that has two or more centers,
each of which has approoximately the same function as the whole
construction are tagged as a coordinating conjunction (CC).
Correlative markers (e.g. 不但...而且 'not only...but also')
are also tagged as (CC).
EXAMPLE: 与/CC 'and'
和/CC 'and'
跟/CC 'and'
或/CC 'or'
并且/CC 'moreover'
又/CC...又/CC 'both...and'
也/CC...也/CC 'not only...but also'
越/CC...越/CC 'the more...the more'
一边/CC...一边/CC 'while...Ving...Ving'
Classifier--CL
Bound morphemes that must occur with a number and/or a determiner
are tagged as a classifier (CL).
EXAMPLE: 人/CL 'man'
件/CL 'item, article'
张/CL 'sheet, extension'
门/CL 'branch'
双/CL 'pair'
种/CL 'kind, species'
尺/CL 'Chinese foot'
公里/CL 'kilometer'
刻/CL 'quarter-hour'
Determiner-DT
End-bound forms include 这 and 那, specifying determiners (e.g. 各,别,另,本),
and quantiwhat tative determiners (e.g. 几,全,多,整). This category is tagged as
determiner (DT). Note: Numbers (CD) are not included in this category.
Directional complement--DR
Forms which are bound to and follow the verb and express place
whereto are tagged as directional complement (DR). See Appendix for discussion
of potential (e.g. 看/VA 得/RB 见/VA and 看/VA 不/RB 见/VA which can be
generalized to VA+RB+V_ where RB=得or不 and V_=VA or VS) and attributive
complements (e.g. 没/RB 关/VA 紧/VS and 弄/VA 完/VS and 洗/VA 干净/VS).
EXAMPLE: 洗/VA 来/VC 了/AS 'washed and brought here'
留/VA 下/VC 'leave behind'
推/VA 进来/VC 'push in'
缩/VA 回来/VC 'shrink back'
Indefinite/Interrogative Determiners--WD
The words 几 and 哪 are classified as indefinite/interrogative determiners(WD).
Indefinite/Interrogative Adverbs--WR
Words like 甚么地方, 多, 多少, 哪儿, 怎么, 怎么样, 如何, 是否, 何时, and 为甚么
are classified as indefinite/interrogative adverbs (WR),
Indefinite/Interrogative Pronouns--WP
The words 谁 and 甚么 are classified as indefinite/interrogative pronouns.
Localizer--LC
A bound morpheme or morpheme compound that forms with a preceding
locative noun or temporal noun is tagged as a localizer (LC).
EXAMPLE: 前/LC 'before'
后/LC 'after'
内/LC 'within'
里/LC 'in'
外/LC 'outside'
Modal--MD
Words that take a clausal object, that is, they must co-occur
with a verb, but they do not take aspect markers, nor can they be
modified by 很, nor can they be nominalized, they cannot occur before the
subject, and cannot take a direct object. These are tagged as modals (MD).
EXAMPLE: 能/MD 'can'
要/MD 'want'
需要/MD 'must'
应该/MD 'should'
将/MD 'will'
Marker-Adjectival 的 de--MJ
This category captures multiple functions of the particle 的. When
的 marks associative phrases, genitive phrases, or nominalization,
then it is tagged as adjectival marker (MJ).
Marker-Verbal 得 de--MV
This category captures verb complementation for potential and
complex stative constructions. Either use of 得 is tagged
as verbal marker (MV).
Marker-Adverbial 地 de--MR
This category identifies the use of 地 to mark manner and is
tagged as adverbial marker (MR).
Noun common--NN
A word which can be modified by a DT-CL (这个) compound is tagged as
a noun. A noun cannot be modified by any adverbs, such as
for example 不,也,还,更 and 快快地. This category is distinct from
proper nouns (NR) and verbal nouns (NV).
Noun proper--NR
The name of a particular person, place, time, or entity is tagged as
a proper noun (NR). A proper noun is usually not modified by a DT-CL
compound (IE: 这个, 那个). For further information, see Multilingual Entity
Task (MET) Guidelines for Chinese.
Noun verbal--NV
In context, a verbal noun may be modified by a DT-CL compound.
This nominal category is tagged as a verbal noun (NV).
We are indebted to Ms. Jin Yang of SYSTRAN for this following working
definition: Frequently verbal nouns are two-character words which are
generally ambiguous between an abstract noun and a verb (which
can be translated as the gerund form of the verb as in 跑路) or
between an abstract noun and a verb (as in 合作).
EXAMPLE: 开放/NV 'Glasnost or opening-up'
访问/NV 'visiting'
获奖/NV 'prize-winning'
LITTLE GROVE CONTEXTUAL EXAMPLES:
-OBJECT OF A PREPOSITION-
08 寄/VA 于/IN 开放/NV 。/。
61 他/PN 在/IN 访问/NV 中/LC 与/CC
-MODIFIED BY A CLASSIFIER-
34 三/CD 位/CL 获奖/NV 科学家/NN
-MODIFIED BY ADJECTIVAL PHRASE, ADJECTIVE, OR NOUN DIRECTLY-
01 政策/NN 的/MJ 协调/NV "/" 问题/NN
88 根据/IN 上述/JJ 协议/NV ,/,
56 俄/NR 海军/NN 教学/NV 中心/NN 。/。
-MODIFIES ANOTHER NOUN DIRECTLY-
87 个/CL 电讯/NN 研究/NV 实验室/NN
-CAN BE THE OBJECT OF AN EXISTENTIAL VERB-
50 有/VE 任何/WD 忽视/NV 。/。
-CAN BE THE OBJECT OF AN ACTION VERB-
05 人员/NN 参加/VA 攻关/NV 。/。
Numeral--CD
Cardinal numbers are tagged as numerals (CD).
Preposition--IN
Also referred to as "coverbs", prepositions are a closed class
of words that cannot take verbal complements or modify nominals.
With a few exceptions, they occur with noun phrases. This
category is tagged as preposition (IN).
EXAMPLE: 在 'at'
把 '(direct object marker)
被 '(agent marker)
给 'for, by'
对 'to'
靠 'depend on'
Pronouns--PN
Words like 我,你,您,他,它,她,(门),自己,别,其它,其, and 大家 are tagged as
pronouns (PN).
Sentence particles--SP
The particles, 了,呢,吧,啊,呀,and 吗 are tagged as sentence-final
particles (SP).
Subordinating Conjunctions--SC
Words that join clauses, one subordinate to another, are tagged as
subordinating conjunctions (SC).
EXAMPLE: 如果/SC 'if' 从而/SC 'although'
如/SC 'if' 并/SC 'moreover'
由于/SC 'since' 但/SC 'but'
以来/SC 'since' 而/SC 'and yet'
因为/SC 'since' 就/SC 'then'
尽管/SC 'no matter' 因此/SC 'thus/therefore'
虽然/SC 'although' 而且/SC 'moreover'
Verb active--VA
Words which can occur with 不 or 没 and which can not occur with adverbs
of extent such as 很__ or 非常__ are tagged as active verbs (VA).
Verb copular--VC
Words like 是 and 为 are tagged as copular verbs (VC).
Verb existential--VE
The word 有 is tagged as an existential verb (VE). It is the high fre-
quency verb of presentation or possession as in the frame VE ((DT) (CD) MW) NN:
EXAMPLE: 有/VE (这/DT) 三/CD 个/MW 人/NN '(These) 3 men were (there).'
or 他/PN 有/VE 三/CD 个/MW 'He has three (of them).'
But when followed by a CD MW string which is a temporal expression, this
word 有 functions ideosyncratically more like a type of preposition as in:
EXAMPLE: 他/PN 有/VE (这/DT) 三/CD 年/MW 学习/VA 汉语/NN
'He studied Chinese for three years.'
Verb stative--VS
Words which can occur with 不 or 没 and which can occur with adverbs of
extent as in 很__ or 非常__ are tagged as stative verbs (VS). These are
linguistically principled distinctions upheld jointly by the PRC's
Beijing University and the Taiwan Academica Sinica's ROCLING
computational linguistics association. In comparison with English, these words appear
similar to the English copular verb plus a predicate adjective when occuring in the
predicative position but appear similar to English adjectives when
occuring in the attributive or complement position.
Related Resource Pages
Web pages which collect links to resources that may be of interest to
information filtering researchers.

Return to CLIP home page