Most of the following files are provided in simplified (".gb") format. A traditional version (".b5") may be obtained through the use a conversion program such as iconv.
| Name | Contents | Source | Download |
|---|---|---|---|
| cedict | A Chinese-English dictionary produced through a collaborative internet-based effort. | Here | HTTP |
| compphrase | A list of Chinese phrases (mostly > 2 char). | Here | HTTP |
| compounds | 2-character compound data: characters, frequency, English definition (originally 'phrases.dat'). | Here | HTTP |
| parts | Composition info: character, radical number, remainder. | Here | FTP |
| radicals | Radical stroke counts, and "extra" strokes (normally counted as part of residue/remainder). | Here | FTP |
| tsi | A list of Chinese characters, words, and phrases with frequency and pronunciation (zhuyin fuhao/bopomofo) format, obtained from the libtabe project 0.2.3 distribution. | Here | HTTP |
| zidian | List of character, pinyin, English definition. | Here | FTP |
Here is some minimal background on the encodings themselves.
| Encoding | Purpose |
|---|---|
| Guobiao | Mainland China's official scheme for simplified character encoding |
| Big5 | A widely used standard in Taiwan and Hong Kong for traditional character encoding |
| Unicode | A two-byte encoding standard for representing most of the world's major writing systems |
| UTF-8 | A unix file-system-safe encoding of the same character set as Unicode but using 1-3 or more bytes (all hanzi seem to use 3 bytes) |
Back to main Chinese page.