CanCLID: Cantonese Computational Linguistics Infrastructure

CanCLID

The Cantonese Computational Linguistics Infrastructure Development Workgroup (CanCLID) is a collaborative effort to build open resources and tooling for Cantonese language technology.

Contributions

Corpus development

  • Collected and cleaned online Guangzhou Cantonese text data.
  • Organized text data for later language-resource and NLP use.

Classification and NLP

  • Worked with Cantonese and Mandarin corpus data for classification workflows.
  • Used language-model-based methods to support Cantonese/Mandarin distinction.

Mozilla Common Voice localization

  • Supported Cantonese UI translation.
  • Helped with text corpus curation, audio recording, and validation.

Input method resources

  • Contributed to Cantonese IME algorithm and resource development.
  • Worked on data and usability problems around Cantonese text input.

Methods

  • Python and pandas for data processing.
  • Corpus cleaning and filtering.
  • Git/GitHub collaboration for open language-resource work.

Project Status: Active contributor (2020 - Present)
Organization: Cantonese Computational Linguistics Infrastructure Development Workgroup
Repository: github.com/CanCLID