Chinese Character Data

In the course of developing a Chinese dictionary and character memorization aid program (Hanzim), I investigated a large number of public domain Chinese dictionary and character data sources available online. The data included:

Most of the following files are provided in simplified (".gb") format. A traditional version (".b5") may be obtained through the use a conversion program such as iconv.

Name Contents Source Download
cedict A Chinese-English dictionary produced through a collaborative internet-based effort. Here HTTP
compphrase A list of Chinese phrases (mostly > 2 char). Here HTTP
compounds 2-character compound data: characters, frequency, English definition (originally 'phrases.dat'). Here HTTP
parts Composition info: character, radical number, remainder. Here FTP
radicals Radical stroke counts, and "extra" strokes (normally counted as part of residue/remainder). Here FTP
tsi A list of Chinese characters, words, and phrases with frequency and pronunciation (zhuyin fuhao/bopomofo) format, obtained from the libtabe project 0.2.3 distribution. Here HTTP
zidian List of character, pinyin, English definition. Here FTP

Here is some minimal background on the encodings themselves.

Encoding Purpose
Guobiao Mainland China's official scheme for simplified character encoding
Big5 A widely used standard in Taiwan and Hong Kong for traditional character encoding
Unicode A two-byte encoding standard for representing most of the world's major writing systems
UTF-8 A unix file-system-safe encoding of the same character set as Unicode but using 1-3 or more bytes (all hanzi seem to use 3 bytes)

