BoPoMoFo, similar to PinYin or other Romanization system used elsewhere, is the most common method for learning Mandarin/Chinese pronunciation in Taiwan area. The system consists of 37 phonetic plus 5 tone symbols. Each basic pronunciation of Chinese is made up of at most 3 phonetic symbols and exactly one tone symbol. Each of the 37 phonetic symbol can appear once in a pronunciation, and not all the combinations of phonetic symbols are meaningful. The 37 phonetic symbols are exclusively divided into 3 groups, only one symbols out of a group can be used to make up pronunciation. The 3 phonetic symbols used to make up a pronunciation come out of the 3 groups, one for each group.
Each of the 37 phonetic symbols are assigned number for 1 to 37. The 5 tone symbols are assigned number from 38 to 42. `0' is used to designate there's no symbol in the position. The word `electricity''s pronunciation code thus is (5, 22, 33, 41).
To help processing of the code, we designed an encoding system that uses 15bit, i.e., less than 2 bytes to represent it. The reason is try to maintain all the combinations while be space-efficient. The first group have phonetic symbols has 21 symbols, the second group has 3, the third group has 13, as shown in Table 1. So we use 6 bits for the first group, 2 bits for the second group and 4 bits for the third group, plus 3 bits for the tone symbols.
1st Group | 2nd Group | 3rd Group | Tone Symbols | |
---|---|---|---|---|
# of Symbols | 21 | 3 | 13 | 5 |
# of Bits | 6 | 2 | 4 | 3 |
The symbol value stored is the offset in it's group plus 1. (0 is reserved) So the encoding for the example code in the previous paragraph is
"0 000101(5) 01(22-21=1) 1001(33-24=9) 100(41-37=4)"It's 2764 in decimal. All the pronunciation related functions used the encoding as internal pronunciation representations and storage format.
The encoding system was inspired by the ETen Chinese System's hashing function, and brought to attention by Yung-Ching Hsiao.