SEACrowd: AI for SEA

From Tagalog to Filipino: Deciphering Language and Identity

2026-01-27T00:00:00+00:00

If you’ve ever spent time with Filipinos, you’ve probably heard the words “Tagalog” and “Filipino” used almost interchangeably. For many, the difference feels minor, but behind those two labels is a long history shaped by migration, colonization, nationalism, and modern technology. Understanding the distinction isn’t just about semantics, it’s about how the Philippines defines its national identity, and increasingly, how that identity is represented in digital spaces and an increasingly AI-centric world.

From Austronesian roots to a colonial transformation

To understand Filipino, we start with its base: Tagalog. As part of the Austronesian language family, Tagalog is related to Indonesian, Malay, and many languages across Southeast Asia and the Pacific. Its story traces back more than 4,000 years to Proto-Austronesian communities in Taiwan.

Long before the Spaniards arrived in the 16th century, Tagalog was already the main language of Central and Southern Luzon. Continuous trade with Chinese, Indian, and Southeast Asian cultures enriched the vocabulary and ideas embedded in the language. At the time, the Tagalogs used the indigenous Baybayin script for personal communication, poetry, and short messages, while longer histories and epics were preserved through a rich oral tradition.

Fig 1. An example of the pre-colonial Baybayin script.

Spanish colonization (1565-1898) systematically sidelined Baybayin, but distinctively, it colonized the script without replacing the tongue. As the friars prioritized evangelization over assimilation, they used the Latin alphabet to translate religious texts into local dialects rather than teaching Spanish to the masses. With official documents and church texts shifting to Latin-based Tagalog—and the Latin script proving more practical for new Spanish-influenced sounds—Baybayin quickly lost relevance and faded from use.

Fig 2. A female Thomasite teacher instructing Filipino students

Following the Spanish-American War in 1898 up until the end of the American occupation in 1946, the United States utilized a radically different strategy: assimilation through mass education. Unlike the exclusive nature of Spanish rule, the Americans established a public schooling system through Act No. 704 of 1901, that enforced English as the sole medium of instruction. By making English the new entry point for commerce and bureaucracy, the Americans effectively institutionalized a foreign language in mere decades, reshaping the linguistic landscape far more aggressively than Spain and pushing Tagalog into a new role during the rise of nationalism.

Constructing a national language

The dream of a unified Filipino nation has always been deeply tied to the dream of a single national language. In 1937, President Manuel L. Quezon declared Tagalog as the basis for the national language. This decision was driven by a mix of commercial practicality and political dominance. As the language of the capital, Tagalog had long served as the default medium for trade. However, it also reflected the influence of the region’s elite. With the seat of government and the center of the revolution historically situated in the Tagalog regions, the dialect naturally became the language of power (refer to the revolutionary constitution).

But in a country home to over 175 distinct languages, this decision was met with controversy. Speakers of Cebuano, Ilocano, and other major languages, felt excluded, sparking a debate about “Tagalog imperialism“—a dynamic which can be likened to the elevation of the Beijing dialect to Standard Mandarin in unifying 20th century China.

To create a more inclusive national language, the term “Pilipino” was introduced in 1959. But it wasn’t until the 1987 Constitution that “Filipino” became the official national language, mandating that it be developed and enriched by all Philippine languages. This mandate represented a big conceptual leap. The 1987 Constitution imagined Filipino as an inclusive national language drawing from all Philippine languages, but the reality is that Filipino remains fundamentally Tagalog; its grammar, syntax, and core vocabulary are still Tagalog at their base, even as it absorbs words from other languages.

Does ‘Filipino’ represent all Filipinos?

Today, the linguistic environment is complex. While Filipino is the national language, is it spoken everywhere? Yes and no. Thanks to mass media and the education system, almost all Filipinos understand it. However, in regions like Visayas and Mindanao (⅔ of the major islands of the Philippines), it often serves as a lingua franca for speaking with outsiders (i.e. business), whereas the local vernacular dominates daily life.

Fig 3. A map showing the linguistic complexity of the Philippine archipelago.

In fact, resentment toward the centralization of Tagalog persists. In areas where Cebuano (part of the Bisaya languages) is dominant, pushback against “Imperial Manila” remains tangible, with some locals preferring to use English rather than Filipino in formal settings. This friction has even migrated to digital spaces, where the viral “Tagalog vs. Bisaya” meme trend frequently resurfaces these long-standing regional rivalries.

This linguistic tension has found new life online, where language itself is evolving faster than ever. Social media has become a laboratory for linguistic evolution: Gen Z slang like charot (just kidding) and sana all (wish it was like that for everyone) among others spread virally across platforms. Tiktok, Twitter/X, and Facebook comments sections show Filipino absorbing not just English loan words, but internet-native expressions that would make purists wince. These digital spaces are where the “development and enrichment” mandated by the constitution actually happens—not in government committees, but through millions of Filipinos creating language in real-time.

Furthermore, the Philippine linguistic ecosystem outside the capital is fragile. While major regional languages (like Ilocano, Hiligaynon, Bikolano, etc.) remain robust, the pressure of learning Filipino and English is impacting smaller languages. Dozens of indigenous languages, such as those spoken by the Itneg and Batak communities are currently classified as dying or endangered as younger generations shift away from them due to the cultural and economic hegemony of the two national languages.

Tagalog vs Filipino: What’s the actual difference?

In practice, linguists generally agree on this:

Filipino = the standardized, modern, national-language variety of Tagalog.

Structurally, they share the same grammar. The difference lies in:

1. Vocabulary breadth

Filipino absorbs words not only from Tagalog, but also Cebuano, Ilocano, Spanish, English, and more.

Katarungan (Justice) - adopted from the Bisaya root tarong (straight/right), used interchangeably with the Spanish-derived hustisya
Trabaho (Work) - a loan word from the Spanish word: trabajo
Kompyuter (Computer) - a loan word ‘Filipinized’ from the English word: computer

2. Alphabet expansion

Baybayin: 17 characters
Abakada (Tagalog): 20 letters
Modern Filipino: 28 letters; adding F, J, Ñ, Z, among others, to accommodate a wider range of words

3. Everyday usage

Modern Filipino reflects how people actually speak today, full of code-switching between Filipino, English, and regional languages.

Examples:

Mag-picture tayo. (Let’s take a picture together.) [Filipino+English]
Na-late ako sa work dahil sa traffic. (I was late for work because of traffic.) [Filipino+English]
I am going to the market, bibili lang ako ng pagkain, para naay makaon sa balay. (I am going to the market, I will just buy some food, so there is something to eat at home.) [Filipino+English+Cebuano]

For many Filipinos, especially in urban areas, this trilingual juggling act is the norm. A Cebuano speaker in Manila might switch between Cebuano with family, Filipino with colleagues, English in formal emails, and a fluid mix of all three when texting friends. For NLP, this means our “Filipino” data isn’t just Tagalog with borrowed words, it’s a dynamic ecosystem where boundaries between languages blur constantly.

Why this matters for NLP

For NLP researchers, the distinction between Tagalog and Filipino isn’t about grammar as structurally, they follow the same system. What matters is knowing what our data actually contains.

Most online Philippine text, whether scraped in Common Crawl or a dataset on HuggingFace—uses the ISO code tl (Tagalog). But the actual content inside is overwhelmingly Filipino, the modern, evolving variety that naturally encompasses code-switching and usage of loan words.

Fig 4. A standard dataset card showing the language code ‘tl’ (Tagalog) but labeled Filipino

In practice, this means we’re already working with Filipino, even though the metadata in data cards, language tags in model configurations, and standard ISO codes are universally labeled “Tagalog” (tl). We don’t need to change the ISO standard or relabel existing corpora. What’s important is recognizing the distinction so that we understand the linguistic reality behind the data we use.

Filipino as a linguistic bridge

Just as Indonesian serves as an anchor for regional languages like Javanese or Sundanese, Filipino can serve as the pivot language for Philippine NLP. Because it shares core grammar and cognates with Cebuano, Hiligaynon, Waray, Kapampangan, Bikol, and many others, a strong Filipino model creates a foundation that accelerates progress for every other language in the country.

English	Tagalog (Pivot)	Cebuano	Ilocano	Kapampangan
Eye	Mata	Mata	Mata	Mata
Two	Dalawa	Duha	Dua	Adwa
House	Bahay	Balay	Balay	Bale
New	Bago	Bag-o	Baro	Bayu

Table 1. A lexical comparison showing high similarity (cognates) across major Philippine languages.

This allows researchers to utilize transfer learning, by effectively “recycling” what an AI learns from the data rich Filipino language and applying it to regional languages that have less data. Instead of building models from scratch for every dialect and language, a good Filipino system can accelerate the creation of translation tools and datasets for the rest of the archipelago.

Understanding the Tagalog-Filipino distinction helps us build models that reflect how people actually communicate today, without needing to replace existing standards.

The double-edged sword: Progress and preservation

Here’s the cold truth: as we build better Filipino NLP models, we might be accelerating the very problem we hoped to solve. With every successful Filipino chatbot, translation tool, or voice assistant, we reinforce Filipino’s dominance, creating economic incentives/dependencies which push communities away from their local languages.

But the same technology could be their lifeline. If Filipino serves as a true pivot language, transfer learning makes it economically feasible to build tools for languages with tiny datasets. Imagine a Filipino to Itneg translation model helping grandparents connect with their apo’s who never learned the language, or documentation systems preserving Batak oral traditions before the last fluent speakers pass away.

Fig 5. The San Juanico Bridge spans the strait between Samar and Leyte—a symbol of connection in an archipelago of over 7,600 islands.

The choice isn’t inevitable. It depends on whether researchers treat Filipino as the destination or as the bridge. If we stop at building excellent Filipino models, we’ve simply digitized linguistic hegemony. But if we use Filipino’s advantage to pull smaller languages into the digital age, we might preserve what otherwise would be lost.

Preserving identity in the algorithm

The difference between Tagalog and Filipino isn’t just academic trivia—it reflects how the Philippines defines itself as a nation. Acknowledging Filipino as the national, standardized, evolving language means acknowledging our linguistic diversity while recognizing that its structure remains inherently rooted in Tagalog.

For NLP researchers, this awareness shapes everything we build. It helps us interpret our data honestly, design models that reflect how Filipinos actually communicate, and most critically, ensures that as we bring Philippine languages into the age of AI, we’re creating tools that connect rather than divide.

The algorithms we create today will shape which languages survive tomorrow. That’s not just a technical challenge, it’s a responsibility to the 175+ languages that make up who we are.

Growing SEA-Native AI Research Talent: Retrospective on the SEACrowd Apprentice Program 2024–25

2025-12-03T00:00:00+00:00

I’m Samuel Cahyawijaya, a Member of Technical Staff at Cohere and one of the initiators of SEACrowd.

Over 2024-2025, I served as a lead mentor for the SEACrowd Apprentice Program. I watched early-career researchers across SEA go from “I’m curious about AI research” to co-authoring papers and releasing a multilingual, culturally grounded dialogue dataset.

This post shares what we built, what was hard, what we’re changing, and how you can join Batch 2026.

TLDR;

We ran our first SEACrowd Apprentice Program cohort focused on real research.
Along the way, we hit hard problems: uneven commitment, skill gaps inside teams, mentor bandwidth issues, and losing momentum over a long timeline.
We’re changing the structure for Batch 2026 with clearer selection, midterm and end-term milestones, external evaluations, and better-matched teams.
Application is open until Dec 17, 2025 11:59PM UTC-12. Go to our page for more details.

About SEACrowd

SEACrowd is a community-run initiative focused on AI research and development for Southeast Asia. Rather than just debating SEA’s lack of representation, we build models and datasets, and do research that center SEA languages, cultures, and realities.

For this, we bring together researchers, students, and practitioners who care about SEA languages and communities. Read more about us here.

Why We Started the SEACrowd Apprentice Program

As SEACrowd grew, we kept bumping into the same problem:

Many talented people in Southeast Asia want to do AI research but don’t see a clear path from “interested” to “actually doing research that gets published.”

The gaps are recurring:

Lack of mentorship. It’s hard to get regular feedback from experienced researchers who understand both the technical side and the regional context.
No structured, low-barrier entry point. Many people in the region see research and postgraduate study as distant or already out of reach, especially when pitted against peers from better-resourced countries and institutions.
Limited exposure to full projects. Many only see pieces of the workflow (e.g., coding, reading papers), not the entire journey from idea → experiments → writing → submission.

These issues motivated us to create a program that would:

Design around real, scoped, publishable projects
Pair experienced researchers with early-career talent across Southeast Asia for hands-on learning

We wanted this to go far beyond a reading group or workshop: to meaningfully upgrade mentees’ research skills and strengthen their chances for PhD applications and AI research careers. And all of this concretized in our first SEACrowd Apprentice Program cohort in 2024-25.

What We Accomplished in Batch 2024–25

Looking back at the first cohort, I’m genuinely proud of what the teams pulled off. In one cycle, mentees contributed to:

A new technique for controlling language use in multilingual LLMs → Language Surgery in Multilingual Large Language Models, published at Multilingual Representation Learning (MRL) Workshop @ EMNLP 2025
A way to recover linguistic typology from language model behavior → Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations at MRL 2025
A culturally grounded dialogue dataset & benchmark for SEA languages → SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Click on each project below to have a closer look!

👥 Mentees: Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong

→ Read paper, published at MRL 2025

The problem

Multilingual LLMs support many languages, but anyone who’s tried prompting them in under-resourced languages has probably seen this:

You prompt in one language. The model partly responds in another (often English), drifts into mixed-language output, or drops in quality compared to high-resource languages like English or Chinese.

So, we wanted to understand:

How do multilingual LLMs organize languages internally?
Can we influence which language they “think in” or generate at inference time without retraining the model?

The approach

The team studied how multilingual LLMs organize languages in their latent space (the internal continuous representation). Specifically, they looked at how representations shift when the model handles different languages, and how those representations relate across languages.

They developed Inference-Time Language Control (ITLC), a method for nudging models to produce more consistent language outputs from cross-lingual prompts, without retraining.

Why this matters

Low-resource languages often get 1) less stable outputs, 2) higher rates of language switching, and 3) lower average quality.

ITLC offers a way to get more stable behavior in underrepresented languages using existing models where retraining is expensive or impractical. It also helps us understand how languages are arranged inside these models.

👥 Mentees: Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady

→ Read paper, published at MRL 2025

The problem

Linguistic typology—mapping how languages relate to each other—has traditionally come from decades of manual documentation, linguistic expertise, or resources like Ethnologue and Glottolog).

Modern LLMs, meanwhile, have learned behavior across many languages through large-scale training with rich, internal knowledge about which tokens are likely in which contexts.

Given the ability of LLMs to learn multiple languages through the large-scale training, we believe that this knowledge could provide a fine-grained representation that captures the dynamic across different languages as we know in linguistic typology. But can the internal language knowledge of LLMs can actually reflect the linguistic typological relationships similar to those in Ethnologue and Glottolog?

The approach

The team extracted language representations from cross-lingual language modeling entropy collected from a diverse set of monolingual language models (LMs). Entropy is a measure of how unexpected the next token is, according to the model; if a model shows similar prediction patterns in two languages, those languages might be structurally similar.

They built Entropy2Vec, a language embedding where each language is a continuous vector in a latent space, and distances between vectors match linguistic typology relationships. Instead of understanding low-level handcrafted features like word order, syntax, or morphological complexity, they used the model’s own internal knowledge (i.e., how it predicts next tokens) to recover typological structure.

Why this matters

The learned language vectors mimic typological structure (i.e., languages that are close in the typology tree end up close in the embedding space) and can regularize fine-tuning to improve adaptability to unseen languages.

For multilingual research in SEA, where many languages are under-documented, this line of work opens the door to:

reusing existing models to infer structural relationships
inform better transfer learning strategies
helping models generalize better to languages with very little data

👥 Mentees: Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi

→ View dataset on HuggingFace && Read paper

The problem

Even as LLMs improve, underrepresented languages perform worse and frequently fail to align with local norms, values, and expectations in multi-turn conversations.

The approach

The team built a culturally grounded, multi-turn dialogue dataset for SEA languages. They collected, curated, and annotated conversations that reflect real cultural practices—local forms of politeness, family structures, community dynamics. The resulting dataset is suitable for both training and evaluating value-aware conversational models.

Why this matters

SEADialogues provides a targeted resource for improving conversational AI in SEA languages, a benchmark for evaluating value alignment in culturally rich settings, and a starting point for future work on dialogue safety and social norms in SEA contexts.

What We Learned (and What We’re Changing)

Running the first cohort was encouraging and humbling. Other than the successes, we hit challenges around time, skill differences, mentor bandwidth, and sustaining momentum. Here’s what observed and how we’re changing:

1. Commitment and Time Constraints

Mentees juggled full-time obligations, different experience levels, and life events. Even when motivation was high, sustained commitment across several months on a full research project proved challenging.

So we’re changing with:

Refined selection criteria assessing prior exposure to coding and machine learning (ML), time availability, and clarity of motivation.
Clearer upfront communication about timeline and commitment expectations, especially where the heavy lifts are, and what level of commitment is realistic.

2. Skill Gaps Within Teams

Mentees had varying experience: some with prior NLP research, others new but determined. Diversity is a strength, but when one or two people carry most of the technical load, they burn out. When others feel they’re slowing the team, they disengage.

So we’re changing with:

More thoughtful team formation with shared baselines and lead mentees for task distribution.
Projects structured so that different contributions matter—design, implementation, analysis, writing. The goal is to make it easier for mentees with different strengths to stay engaged and grow while still involved in most of the steps.

3. Mentor Availability and Consistency

Mentors are busy professionals or researchers volunteering their time for SEA talent training.

We also faced challenges with scheduling across time zones, supporting mentees when mentors got busy, and ensuring consistent feedback and guidance when a team had more than one mentor.

So we’re changing with:

More explicit mentor roles within each team
Coordinated mentor expectations to avoid conflicting directions
Built-in redundancy so teams aren’t stranded if one mentor gets overloaded
Shared guidelines for mentors about program goals and balancing ambition with feasibility

We hope to make the mentee experience more coherent, regardless of which team they’re on.

4. Long Project Timeline and Momentum

Initially, we planned for the program to end whenever the write-up was ready for submission, but that open-ended timeline ended up hurting motivation and discipline.

On top of that, progress naturally came in waves—exciting starts, dragging middles, intense finals. Keeping everyone motivated through the uncertain middle was one of the toughest parts, especially when early experiments failed or teams weren’t sure their direction was still viable.

So we’re changing with:

Fixed four-month window with clear checkpoints.
Midterm presentations where teams present their progress to a broader group, get feedback from other mentors and can adjust scope and direction if needed.
Clear end-of-term assessment with cross-team evaluation considering both research quality and team process.

So Was It Worth It?

The first SEACrowd Apprentice Program started as an experiment:

Could we meaningfully support early-career SEA researchers to produce real, impactful work in a relatively short time, across countries and time zones, online?

The answer is: Yes, absolutely!

Batch 2024-25 delivered two accepted workshop papers at MRL 2025, one new multilingual dialogue dataset, and a group of mentees who now have hands-on research experience, co-authorship in respected venues, and a stronger network across SEA and beyond.

We’re making structural changes for Batch 2026 because we now have a clearer picture of where early-career SEA researchers struggle, what support is most valuable, and how to design a program that respects constraints while demanding real work.

How to Join Batch 2026

If you’ve read this far, there’s a good chance you want to try this. Consider applying to the SEACrowd 2026 Apprentice Program if you:

Are based in or connected to Southeast Asia
Have basic Python coding and AI/ML/NLP familiarity
Can commit consistent time over several months
Are curious about research beyond using APIs or off-the-shelf models
Want a path toward PhD programs or AI research roles

We can’t guarantee a paper for every team, but we offer a structured environment, mentors who care about your growth, and an honest path from “I’m curious about AI research” to “I’ve actually done it.”

🗓️ Mentee application deadline: Dec 17, 2025 (UTC-12).
🌐 Full details, requirements, and the application form, see our Apprentice Program page.

If you’d like to help empower the next generation of SEA researchers with SEACrowd, you can support us at this link.

Mind the Language Gap: Building an Inclusive AI Future for Southeast Asia

2025-09-22T00:00:00+00:00

Originally posted as a perspectives piece on Tech for Good Institute

Southeast Asia’s (SEA) digital ambitions are accelerating. Governments across the region are launching national AI strategies, digitizing public services, and investing in infrastructure to drive economic growth and social development. Initiatives like the ATIPAN project in the Philippines and MediBot in Timor-Leste—bringing AI-powered healthcare to remote communities—demonstrate how transformative these technologies can be. Yet amid this momentum lies a quiet but urgent gap: the AI systems shaping SEA’s digital future often fail to represent its languages, cultures, or lived realities.

With over 100 ethnic groups who speak over a thousand living languages and dialects, SEA is one of the most linguistically diverse regions on Earth. Yet modern AI systems are largely trained on English and a few other global languages, leaving most SEA communities (speakers of Javanese, Tagalog, Burmese, and many others) underrepresented or invisible in the AI development.

Why does this matter? This language gap isn’t just technical; it’s a barrier to equitable digital inclusion, as language is deeply tied to identity, trust, and nuance. When AI systems misinterpret what users say—or fail to speak in ways that feel natural or respectful—they risk delivering harmful advice, misclassifying inputs, or simply being ignored. A major 2024 study published at EMNLP confirmed this: large language models struggle with low-resource SEA languages, leading to mistranslations, cultural misinterpretations, and even harmful outputs. For example, AI might misread a patient’s expression of fatigue as benign when, in cultural context, it hints at something more serious like vertigo, and then make non-localized recommendations, suggesting an urgent clinic visit which may be helpful in Singapore but unrealistic or distressing in rural Indonesia, where the nearest doctor may be hours away.

For example, when asked to define the Filipino word “sapot”, ChatGPT interpreted it as “mental confusion” or “being tangled.” In context, it actually refers to a state of psychological unease—like a web of emotions—best understood through lived experience. Such misinterpretations aren’t trivial: they erode trust, miss nuance, and limit the emotional accuracy essential in mental health tools. While AI is scaling and improving mental health care across the English-speaking world, local language speakers cannot benefit.

As AI becomes more embedded in commerce, education, and governance, how can the region ensure that the future of AI truly reflects its people, languages, and lived realities?

Challenges & Barriers

Southeast Asia’s push for inclusive AI faces four interconnected challenges: data scarcity, fragmented development, limited market incentives, and gaps in trust.

Most SEA languages lack the large, high-quality datasets needed to train robust models. Where data exists, it is often scattered across informal sources and hard to standardise. The problem is worse for languages with strong oral traditions, which may have little or no digital footprint. Building quality datasets requires more than literal translation, which risks producing awkward “translationese”; it demands deep cultural grounding. A 2024 study by SEACrowd showed that popular global models underperform on SEA language tasks., particularly in generating natural-sounding text. Even when technically included, model performance for languages with limited digital presence fall behind, mirroring the hierarchy of data availability. Small language groups, already excluded from services, risk further marginalisation when AI tools bypass them.

Culturally-unaware image generation of “people eating chicken pastel”, a popular dish in the Philippines, using stability.ai’s AI model Stable Diffusion 3.5 Large.

National AI strategies often prioritise infrastructure, data governance, and economic competitiveness, sidelining linguistic inclusion. Policy approaches vary widely between countries, and without regional coordination or data-sharing frameworks (e.g., common formats, ethical standards, pooled compute resources), efforts remain siloed. Some promising local initiatives are beginning to emerge. Thailand’s Typhoon model, an accessible Thai-centric Large Language Model (LLM), was also trained on informal language to capture stylistic nuances that global models often overlook. Indonesia’s NusaCrowd curated high-quality open datasets for low-resource languages, including widely spoken Javanese and Sundanese, as well as endangered tongues like Lampung and Buginese, capturing the breadth of linguistic diversity and cultural contexts such as code-switching and shifting levels of formality. Yet, without sustained investment and alignment with broader ASEAN strategies, their long-term support and interoperability remain limited. Regional collaboration is especially crucial in Southeast Asia, where many languages—like Malay, Khmer, and Hmong—cross national borders, and individual countries may lack the capacity to build full-stack AI pipelines independently.

Because big tech companies prioritise mainstream languages with existing commercial value, indigenous and low-resource languages are rarely incorporated into their models or business strategies. Meanwhile, local startups, academic labs, and grassroots groups often lack the computing power and funding needed to build language-specific tools. The region also faces a shortage of skilled NLP researchers and data engineers experienced in low-resource AI development, leaving the ecosystem under-resourced.

For perspective, SEA-LION, Southeast Asia’s flagship open-source LLM project, was built by 31 authors—compared to 199 for China’s DeepSeek-RI, and a staggering 3,295 contributors behind Google’s latest Gemini model.

In much of Southeast Asia, AI adoption is constrained by foundational infrastructure challenges: limited connectivity, unreliable power, costly or low-spec devices, and insufficient digital literacy. Even when tools are available, widespread usage is not guaranteed. Poor localisation—beyond mere translation—can result in awkward tone, cultural mismatches, or unfamiliar interfaces. In the region, this may manifest as overly formal language, failure to interpret code-switching (the blending of languages), or disregard for indirect communication norms. When tools feel extractive or culturally alien, they risk eroding user trust.

Opportunities and Solutions: Building Inclusive AI from the Ground Up

Despite the barriers, SEA has a unique opportunity to lead in creating AI that is truly inclusive and culturally grounded. The region can chart its own path—treating linguistic and cultural diversity as assets, not obstacles. With its deep traditions of multilingualism, code-switching, oral storytelling, and cultural hybridity, SEA is well-placed to pioneer flexible, context-aware AI systems that handle code-switching, shifting levels of formality, and socially complex communication.
AI attuned to SEA’s complexity could enable trust-sensitive applications, from health promotion in conservative areas to crisis communication across multiple languages and dialects.

1. Local Innovation and Homegrown Solutions

A growing ecosystem of regional initiatives is tackling SEA’s unique linguistic challenges, blending grassroots energy with institutional support. Community-led efforts like SEACrowd are making a significant impact—curating hundreds of corpora covering nearly 1,000 languages, building performance benchmarks in 38 Southeast Asian languages (for comparison, OpenAI’s latest model only benchmarked performance on five SEA languages), while nurturing local AI talent. SEACrowd also collaborates with global open source initiatives such as ML Commons, Common Crawl, and Masakhane to share lessons and enable the global shift toward community-led, inclusive AI development. The Singapore-based SEA-LION initiative is creating open-source LLMs trained on 11 Southeast Asian languages to capture cultural nuances, while Thailand’s Typhoon model and Indonesia’s NusaWrites are building datasets and models rooted in local context. Together, these efforts offer a powerful alternative to global models that often overlook the region’s linguistic diversity.

Launch of SEACrowd at EMNLP 2023.

Beyond technology, these initiatives play a vital preservation role. UNESCO warns that nearly 40% of the world’s languages—many in SEA—are endangered. By creating a digital footprint, these initiatives help safeguard not only languages but also the cultural knowledge embedded within them.

2. Regional Coordination and Shared Infrastructure

To break silos, ASEAN—working alongside universities, community groups, international organisations such as UNESCO, and global open-source initiatives–should support interoperable data frameworks and shared standards. Projects like SEA-VL—pairing over a million culturally relevant images with local-language captions—show both the value and complexity of cross-border collaboration. A Southeast Asian NLP Commons could standardise benchmarks, ethics, and governance, especially for indigenous and low-resource languages. India’s AI4Bharat offers a model, funding open datasets in over 20 Indian languages with government, academic, and civil society support.

3. Enabling Ecosystems through Policy and Incentives

Governments can treat linguistic datasets as public digital goods and fund open-source AI for regional languages. Procurement policies, tax incentives, and grants can spur business investment in inclusion. Policymakers are starting to take notice—ASEAN’s Guide on AI Governance and Ethics and Singapore’s IMDA emphasise inclusive data practices. However, unless language equity becomes a core pillar of digital transformation, SEA risks developing AI that speaks over its people.

4. Trust, Transparency, Inclusion

Language inclusion must be participatory. Co-governance models—where contributors shape data practices and evaluation—build awareness, trust, and ownership. Investing in mentorship, transparency, and shared control ensures SEA’s digital future reflects its full diversity.

Conclusion

Southeast Asia’s digital future depends on closing the language gap in AI. The region’s linguistic diversity is a strategic asset, yet current systems exclude many communities. This article outlined the barriers—like missing datasets and weak incentives—and showcased efforts like SEACrowd, Typhoon, and Nusa Collection that chart a new path. Policymakers must treat local language data as essential infrastructure, while industry and communities build AI that genuinely reflects SEA voices. Inclusive AI is not optional—it is a strategic imperative to ensure digital transformation benefits everyone. By investing in linguistic inclusion, SEA doesn’t just close a gap-it leads the world in building AI that truly belongs to everyone.

About the Authors

Onno Kampman is an AI Scientist at Singapore’s MOH Office for Healthcare Transformation (MOHT) and a Visiting Scientist at the University of Cambridge. He leads pioneering projects that apply AI to mental health care transformation, and contributes to SEACrowd’s mission to boost Southeast Asian AI capabilities.

Holy Lovenia is the Lead of SEACrowd. Based in London and affiliated with AI Singapore, she drives SEACrowd’s strategy to unify and scale AI resources across Southeast Asia—most recently through initiatives like SEACrowd’s multilingual benchmarks and the SEA‑VL vision‑language dataset.

About the Organisation

SEACrowd is a research community advancing Southeast Asia-focused AI and empowering the next generation of AI researchers in the region. The organisation envisions a future where Southeast Asia’s AI ecosystem is mature, globally competitive, and grounded in the region’s diverse linguistic and cultural contexts.

SEACrowd’s initiatives include leading data collection and model development efforts tailored to Southeast Asia, building and connecting a regional research network, and supporting early-career talent through mentorship and hands-on experience via the SEACrowd Apprentice Program.

SEACrowd forms legal entity in Indonesia

2024-08-22T00:00:00+00:00

Today, SEACrowd has formed a legal entity in Indonesia. 🇮🇩

This marks a significant milestone in our growth and commitment to fostering local collaborations and innovation across Southeast Asia. We look forward to strengthening our presence and contributing to AI in SEA even more.

The structure of our legal entity is as follows:

Board of Trustees
- Alham Fikri Aji (Chief)
- Samuel Cahyawijaya
Executive Board
- Holy Lovenia (Chief)
- Salsabil Maulana Akbar (Secretary)
- Frederikus Hudi (Treasury)
Supervisory Board
- Genta Indra Winata (Chief)
- Fajri Koto
- Jennifer Santoso

SEACrowd: AI for SEA

From Tagalog to Filipino: Deciphering Language and Identity

From Austronesian roots to a colonial transformation

Constructing a national language

Does ‘Filipino’ represent all Filipinos?

Tagalog vs Filipino: What’s the actual difference?

1. Vocabulary breadth

2. Alphabet expansion

3. Everyday usage

Why this matters for NLP

Filipino as a linguistic bridge

The double-edged sword: Progress and preservation

Preserving identity in the algorithm

Growing SEA-Native AI Research Talent: Retrospective on the SEACrowd Apprentice Program 2024–25

About SEACrowd

Why We Started the SEACrowd Apprentice Program

What We Accomplished in Batch 2024–25

Project 1: Language Surgery in Multilingual Large Language Models

The problem

The approach

Why this matters

Project 2: Entropy2Vec — Crosslingual Language Modeling Entropy

The problem

The approach

Why this matters

Project 3: SEADialogues — Culturally Grounded Dialogue Dataset

The problem

The approach

Why this matters

What We Learned (and What We’re Changing)

1. Commitment and Time Constraints

2. Skill Gaps Within Teams

3. Mentor Availability and Consistency

4. Long Project Timeline and Momentum

So Was It Worth It?

How to Join Batch 2026

Mind the Language Gap: Building an Inclusive AI Future for Southeast Asia

Challenges & Barriers

Data Scarcity

Fragmented Development

Limited Market Incentives

Gaps in Accessibility and Trust

Opportunities and Solutions: Building Inclusive AI from the Ground Up

1. Local Innovation and Homegrown Solutions

2. Regional Coordination and Shared Infrastructure

3. Enabling Ecosystems through Policy and Incentives

4. Trust, Transparency, Inclusion

Conclusion

About the Authors

About the Organisation

SEACrowd forms legal entity in Indonesia