As some of these have since disappeared from the web, I am making them available here.
Most of the following files are provided in simplified (".gb") format. A traditional version (".b5") may be obtained through the use a conversion program such as iconv.
| Name | Contents | Source | Download |
|---|---|---|---|
| cedict | A Chinese-English dictionary produced through a collaborative internet-based effort. | Here | HTTP |
| characters.dat | Frequency info on characters and stroke count. | Here | HTTP |
| compphrase | A list of Chinese phrases (mostly > 2 char). | Here | HTTP |
| compounds | 2-character compound data: characters, frequency, English definition (originally 'phrases.dat'). | Here | HTTP |
| ciyu | English-Chinese dictionary with one-word English glosses, containing 1- 2-, and multicharacter entries; this work is incomplete, as many of the supposed English glosses are nothing but pinyin representations (of, I believe, Taiwanese pronunciations). | Here | HTTP |
| dict.zip | A much enhanced version of the ciyu English-Chinese dictionary together with a reasonably extensive Chinese-English dictionary and a Windows program providing an interface to it. | Gone | HTTP |
| parts | Composition info: character, radical number, remainder. | Here | HTTP |
| radicals | Radical stroke counts, and "extra" strokes (normally counted as part of residue/remainder). | Here | HTTP |
| tsi | A list of Chinese characters, words, and phrases with frequency and pronunciation (zhuyin fuhao/bopomofo) format, obtained from the libtabe project 0.2.3 distribution. | Here | HTTP |
| zidian | List of character, pinyin, English definition. | Here | HTTP |
Here is some minimal background on the encodings themselves.
| Encoding | Purpose |
|---|---|
| Guobiao | Mainland China's official scheme for simplified character encoding |
| Big5 | A widely used standard in Taiwan and Hong Kong for traditional character encoding |
| Unicode | A two-byte encoding standard for representing most of the world's major writing systems |
| UTF-8 | A unix file-system-safe encoding of the same character set as Unicode but using 1-3 or more bytes (all hanzi seem to use 3 bytes) |
| UTF-7 | A mail-safe (basically ASCII, I believe) encoding of the same character set as unicode but using greater numbers of bytes |
Note, there are a couple of versions of GB that appear to be slightly incompatible, and a number of versions of Big5 that are apparently even more incompatible. I'm not sure which versions the programs here apply to, but I've used them with success for gb2312-1980-0, gb2312-1980-1, and the "eten" version of Big5. Also, there is an encoding called "CNS" which is the government (but not commercial) standard of Taiwan. I have not seen it being used very much.
Back to main Chinese page.