Chinese Character Data

In the course of developing a Chinese dictionary and character memorization aid program (Hanzim), I investigated a large number of public domain Chinese dictionary and character data sources available online. The data included:

frequency data for characters and multi-character words
composition data for single characters (radical, phonetic component)
mappings from character to pinyin Mandarin pronunciation
mappings from characters and multi-character words to English meaning

Most of the following files are provided in simplified (".gb") format. A traditional version (".b5") may be obtained through the use a conversion program such as iconv.

Name	Contents	Source	Download
cedict	A Chinese-English dictionary produced through a collaborative internet-based effort.	Here	HTTP
compphrase	A list of Chinese phrases (mostly > 2 char).	Here	HTTP
compounds	2-character compound data: characters, frequency, English definition (originally 'phrases.dat').	Here	HTTP
parts	Composition info: character, radical number, remainder.	Here	FTP
radicals	Radical stroke counts, and "extra" strokes (normally counted as part of residue/remainder).	Here	FTP
tsi	A list of Chinese characters, words, and phrases with frequency and pronunciation (zhuyin fuhao/bopomofo) format, obtained from the libtabe project 0.2.3 distribution.	Here	HTTP
zidian	List of character, pinyin, English definition.	Here	FTP

Here is some minimal background on the encodings themselves.

Encoding	Purpose
Guobiao	Mainland China's official scheme for simplified character encoding
Big5	A widely used standard in Taiwan and Hong Kong for traditional character encoding
Unicode	A two-byte encoding standard for representing most of the world's major writing systems
UTF-8	A unix file-system-safe encoding of the same character set as Unicode but using 1-3 or more bytes (all hanzi seem to use 3 bytes)

Back to main Chinese page.