CanCLID: Cantonese Computational Linguistics Infrastructure

CanCLID

The Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID) is a collaborative effort to build open resources and tooling for Cantonese language technology.

Contributions

Corpus development

Collected and curated a high-quality corpus of online text in Guangzhou Cantonese.
Organized text data for later language-resource and NLP use.

Classification and NLP

Trained a Cantonese language model based on Guangzhou and Hong Kong text corpora.
Built a Cantonese/Mandarin classifier from language-model-based classification work.

Mozilla Common Voice localization

Led Cantonese localization work, including UI translation.
Worked on corpus collection/refinement, audio recording, and recording verification.

Input method resources

Optimized the Cantonese IME algorithm to enhance functionality and prediction accuracy.
Worked on data and usability problems around Cantonese text input.

Methods

Python and pandas for data processing.
Corpus cleaning and filtering.
Git/GitHub collaboration for open language-resource work.

Project Status: Core contributor (2020 - Present) Organization: Cantonese Computational Linguistics Infrastructure Development Workgroup
Repository: github.com/CanCLID

Share on

Twitter Facebook LinkedIn

Zinan Liang / Tsinam Leung (梁梓楠)

CanCLID

Contributions

Methods

Share on