CanCLID: Cantonese Computational Linguistics Infrastructure
CanCLID
The Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID) is a collaborative effort to build open resources and tooling for Cantonese language technology.
Contributions
Corpus development
- Collected and cleaned online Guangzhou Cantonese text data.
- Organized text data for later language-resource and NLP use.
Classification and NLP
- Worked with Cantonese and Mandarin corpus data for classification workflows.
- Used language-model-based methods to support Cantonese/Mandarin distinction.
Mozilla Common Voice localization
- Supported Cantonese UI translation.
- Helped with text corpus curation, audio recording, and validation.
Input method resources
- Contributed to Cantonese IME algorithm and resource development.
- Worked on data and usability problems around Cantonese text input.
Methods
- Python and pandas for data processing.
- Corpus cleaning and filtering.
- Git/GitHub collaboration for open language-resource work.
Project Status: Active contributor (2020 - Present)
Organization: Cantonese Computational Linguistics Infrastructure Development Workgroup
Repository: github.com/CanCLID
