AI's global village opens wider to more voices
Artificial intelligence engineer Jacky Chan Ho-kit has conflicting feelings about his industry. While he looks forward to a future where AI reaches its pinnacle — possessing humanlike cognitive capabilities — he is deeply concerned that it will only understand English. "Given the language status quo, this is highly likely to be a reality rather than just alarmism," he said.
Chan is the chief technology officer at Votee, a Hong Kong-based AI company. He is also a language enthusiast who in his free time follows language bloggers on social media, absorbing their linguistic insights. Through his research, he has learned that many languages are disappearing. Even though there are around 7,000 languages still in use globally, according to the World Atlas of Languages of UNESCO, only 10 boast more than 200 million speakers. UNESCO has said that a language vanishes every two weeks, with 25 disappearing annually.
Language Disparity in the Online World
Over the last decade, English content has dominated the internet, accounting for 49.4 percent as of Nov 26 — more than eight times the use of Spanish, the second most prevalent online language at 6 percent, according to a report by W3Techs, a company that conducts global web surveys.
Conversely, the proportion of web pages that use Chinese, the second-most spoken language in the physical world with more than 1.1 billion speakers, has plummeted from 4.3 percent in 2013 to 1.2 percent in 2024.
In the realm of AI, prominent large language models, or LLMs, like Open-AI's ChatGPT4, Google's Gemini, and Anthropic's Claude all use English as their main language.
The Challenge of Data Scarcity
The cornerstone of training AI lies in data. A significant hurdle in advancing AI's linguistic prowess is the scarcity of data available in numerous languages, Chan said. Of about 7,000 languages spoken worldwide, nearly 99 percent are considered low-resource languages, as the data available for computational processing and analysis is limited.
Endangered Languages and Preservation Efforts
While commercial demand ensures the survival of languages with a large offline population, those with few speakers, limited commercial interest, and insufficient technological research are at risk of becoming endangered both online and offline, Chan warned.
Based on this definition, even language dialects that are spoken by substantial populations, like Minnan and Hakka, which is primarily used in southern China, face a fight for survival as fewer young people are learning them.
Preserving Cultural Diversity through Technology
With hundreds of indigenous languages in Africa at risk of extinction, Votee has worked with clients on the continent to assist in language preservation efforts. However, significant challenges stem from Africa's political instability, limited technological proficiency and insufficient technology infrastructure.
Chan proposed that global tech firms, language-focused NGOs, linguists and language enthusiasts collaborate to form communities for mutual support and to encourage the preservation of endangered languages.