Chinese Character Data

In the course of developing a Chinese dictionary and character memorization aid program (Hanzim), I investigated a large number of public domain Chinese dictionary and character data sources available online. The data included:

As some of these have since disappeared from the web, I am making them available here.

Most of the following files are provided in simplified (".gb") format. A traditional version (".b5") may be obtained through the use a conversion program such as iconv.

Name Contents Source Download
cedict A Chinese-English dictionary produced through a collaborative internet-based effort. Here HTTP
characters.dat Frequency info on characters and stroke count. Here HTTP
compphrase A list of Chinese phrases (mostly > 2 char). Here HTTP
compounds 2-character compound data: characters, frequency, English definition (originally 'phrases.dat'). Here HTTP
ciyu English-Chinese dictionary with one-word English glosses, containing 1- 2-, and multicharacter entries; this work is incomplete, as many of the supposed English glosses are nothing but pinyin representations (of, I believe, Taiwanese pronunciations). Here HTTP
dict.zip A much enhanced version of the ciyu English-Chinese dictionary together with a reasonably extensive Chinese-English dictionary and a Windows program providing an interface to it. Gone HTTP
parts Composition info: character, radical number, remainder. Here HTTP
radicals Radical stroke counts, and "extra" strokes (normally counted as part of residue/remainder). Here HTTP
tsi A list of Chinese characters, words, and phrases with frequency and pronunciation (zhuyin fuhao/bopomofo) format, obtained from the libtabe project 0.2.3 distribution. Here HTTP
zidian List of character, pinyin, English definition. Here HTTP


Here is some minimal background on the encodings themselves.

Encoding Purpose
Guobiao Mainland China's official scheme for simplified character encoding
Big5 A widely used standard in Taiwan and Hong Kong for traditional character encoding
Unicode A two-byte encoding standard for representing most of the world's major writing systems
UTF-8 A unix file-system-safe encoding of the same character set as Unicode but using 1-3 or more bytes (all hanzi seem to use 3 bytes)
UTF-7 A mail-safe (basically ASCII, I believe) encoding of the same character set as unicode but using greater numbers of bytes

Note, there are a couple of versions of GB that appear to be slightly incompatible, and a number of versions of Big5 that are apparently even more incompatible. I'm not sure which versions the programs here apply to, but I've used them with success for gb2312-1980-0, gb2312-1980-1, and the "eten" version of Big5. Also, there is an encoding called "CNS" which is the government (but not commercial) standard of Taiwan. I have not seen it being used very much.



Back to main Chinese page.