Alibaba's DAMO Academy Unveils LLMs Designed For Southeast Asia

The LLMs are optimized to process Southeast Asian languages and can reflect cultural nuances
Addressing demand for localized LLMs vs LLMs trained on English and Latin-based datasets

Photo credit: Shutterstock

Alibaba Group's research institute DAMO Academy unveiled on Monday two large language models designed to reflect Southeast Asia's diverse linguistic and cultural landscape.

DAMO Academy released a model called SeaLLM and a conversationally finetuned version called SeaLLM-chat.

The models, which both come in two sizes, 13 billion and 7-billion-parameters, are capable of processing local languages including Vietnamese, Indonesian, Thai, Malay, Khmer, Lao, Tagalog, and Burmese. Both can perform tasks that better align with local customs, style and legal stipulations.

The initiative comes amid rising demand for more locally relevant LLMs from Southeast Asian countries. Singapore, as an example, has created a $52 million AI initiative to develop the Lion City's research and engineering capabilities in multi-modal LLMs.

Alibaba said the launches were designed to create more inclusive and regionally relevant LLMs that reflect the cultural nuances of Southeast Asia. Most LLMs originate from western countries and are trained on datasets that are based disproportionately on languages derived from English and languages derived from Latin.

“This innovation is set to hasten the democratization of AI, empowering communities historically underrepresented in the digital realm,” said Bing Lidong, Director of the Language Technology Lab at Alibaba's DAMO Academy.

DAMO Academy has open-sourced the models on Hugging Face, making them freely available for research and commercial use.

Bridging the Linguistic Divide

Trained on a diverse set of Southeast Asian languages, SeaLLM can interpret and process text up to nine times longer than models like ChatGPT for non-Latin languages, and has more complex task execution capabilities. It outperforms most open-source LLMs in understanding a wide spectrum of subjects from science, chemistry, physics to economics, in the region's languages.

The model outperforms other existing models in machine translation capabilities between English and low-resource languages, referring to those with limited data for training conversational AI systems, such as Lao and Khmer. It also delivers performance on par with state-of-the-art models in most high-resource languages, referring to languages for which many training data sources exist, such as Vietnamese and Indonesian.

Through pre-training enhancements and culturally tailored fine-tuning, the AI assistant powered by SeaLLM-chat can comprehend, respect and accurately reflect the cultural context of the languages in the region, including social norms, linguistic preferences and legal considerations.

“This initiative has the potential to unlock new opportunities for millions who speak languages beyond English and Chinese. Alibaba's efforts in championing inclusive technology have now reached a milestone with SeaLLM's launch,” said Luu Anh Tuan, Assistant Professor in the School of Computer Science and Engineering (SCSE) at Nanyang Technological University, Alibaba's long-term partner in multi-language AI studies.

The culturally attuned LLMs can also empower companies to build their own chatbot assistants for businesses dealing with Southeast Asian markets.

Dsicover Alibaba Cloud for Generative AI Solution

This article was originally published on Alizila, written by Ivy Yu.

Community

Alibaba's DAMO Academy Unveils LLMs Designed For Southeast Asia

Bridging the Linguistic Divide

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

Alibaba Cloud Model Studio

Qwen

AI Acceleration Solution

Alibaba Cloud for Generative AI