libtabe - the way to deal with Chinese

Table Of Content

Introduction

After its pioneering work on Chinese i18n/l10n issues, TaBE Project extends its goal to more general Chinese language processing issues on computer systems.

libtabe, the latest work made available by the Project, is a library which provides useful Chinese functions/routines that deal with many fundamental elements such as pronunciation(BoPoMoFo), character frequency, word identification, word frequency. It also comes with a free word database consists of more than 130,000 words.

More functionalities are expected to merge into the library in the future.

A practical application of libtabe is an intelligent phonetic input method interface, bims. bims accepts input in BoPoMoFo and generates output to meaningful sentence. (Also known as phoneme-to-character resolution)

The bimsphone module of XCIN-2.5 is based on libtabe/bims directly, and in the future more modules might be also based on it.

Problems

Unlike English, Chinese is a ideographic language. A written element (character) in Chinese may represent one or more meaning, while the characters written in a row (word) with other characters may represent other meanings.

Word, is the basic element that people exchange information. Word in Chinese may range from one character to more than 10 characters. It's a well-known problem to identify words in a Chinese sentence.

To make application aware of the content, programmers may try to process content word by word, instead of character by character. However, it's hard to identify words in sentence as well as parsing it's semantic meaning.

Solutions

Some projects are trying to solve the problem using syntax and semantic of Chinese language, however, the cost to construct both databases are expensive and require a lot of experts to be involved.

This is not the solution for most application programmers, especially those open-source software ones.

Our intention is trying to solve the problem by gathering statistical information, plus some heuristics, to help programmers make their application content aware.

Using large corpus gathered from Internet, we are able to catch the way people use Chinese, thus conduct rules to identify words in sentences and provide semantic insight into sentences.

Availability

http://download.sourceforge.net/libtabe/libtabe-0.2.3.tgz

It's here, with helps and suggestions from dozens of people using it, We are working on it and making it toward an ideal tool for programmers.

And you can go libtabe@SourceForge for more development information.

Table Of Content

Introduction

Problems

Solutions

Availability

Bug Reporting

Documentations

Change Log