Log inSign up
iseeaswell꩜bʂky
437 posts
user avatar
iseeaswell꩜bʂky
@iseeaswell
low resource MT, plants, insects, music+sangeetham. Join TUSL, the Low Resource NLP Discord: discord.gg/z3ya9EUS2U
Joined September 2019
143
Following
745
Followers
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    May 17, 2022
    How many languages can we support with Machine Translation? We train a translation model on 1000+ languages, using it to launch 24 new languages on Google Translate without any parallel data for these languages.arxiv.org/abs/2205.03983 Technical 🧵below: 1/18
    Image
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Oct 29, 2020
    What do we need to scale NLP research to 1000 languages? We started off with a goal to build a monolingual corpus in 1000 languages by mining data from the web. Here’s our work documenting our struggles with Language Identification (LangID): arxiv.org/abs/2010.14571 1/8
    Image
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Jun 27, 2024
    Excited to announce that 110 languages got added to Google Translate today! Time for context on these languages, especially the communities who helped a lot over the past few years, including Cantonese, NKo, and Faroese volunteers. Also, a 110-language youtube playlist. 🧵
    50K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Mar 23, 2021
    Does the data used for multilingual modeling really contain content in the languages it says it does? Short answer: sometimes 🙁 arxiv.org/abs/2103.12028 1/n
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    May 11, 2022
    Happy to finally be public about my main project over the last few years: adding more languages to Translate!
    user avatar
    Ankur Bapna
    @ankurbpn
    May 11, 2022
    Excited to share some real world results from our effort on building machine translation models for long tail languages. Here's the research paper that describes the approach in more detail: arxiv.org/abs/2205.03983 Tweet 🧵 coming soon :)
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Sep 25, 2023
    Have you ever wanted a LangID model that works on 1500+ languages? check out FUN-LangID: github.com/google-researc… !
    9.4K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Mar 28, 2023
    I'm excited to open-source GATITOS, a new multilingual lexicon in 26 long-tail languages! arxiv.org/pdf/2303.15265… shows how to use it for an average ChrF boost of +7.0 to +10 over baseline. Open-sourced data here: github.com/google-researc… (1/10)
    6.3K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Jun 5, 2022
    Do you want your language to be supported by NLP (like Google Translate) -- or left alone? Please fill this form if you have thoughts you'd like to share with me :) docs.google.com/forms/d/e/1FAI… 1/4
    Image
    docs.google.com
    Do you speak a language not on Google Translate?
    We are looking for people from various communities to talk to and understand whether they would want their language more supported by technology -- e.g. Google Translate -- or not. You may also fill...
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    May 9, 2023
    Just added 84 more languages to GATITOS! It now has a total of 113 languages, many of which with no other public resources 😊
    github.com
    GitHub - google-research/url-nlp
    Contribute to google-research/url-nlp development by creating an account on GitHub.
    3.9K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Oct 29, 2020
    Replying to @iseeaswell
    As a closing note: PLEASE LOOK AT ANY DATA YOU CRAWL OR TRAIN ON. Publicly available LangID and web corpora also have these issues for lower-resource languages. 7/8
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Jul 1, 2024
    Do you want to help improve translation for the 110 new Google Translate languages? One way is to help correct GATITOS 😼🧵
    3.2K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Jun 6, 2024
    Replying to @xkcd
    The diagram is great but from a linguist's perspective, I don't think the bottom statement is related. [t] -> [ʔ] is just is the allophone of /t/ used word-finally in America, and is not inherently more efficient. (And this is not done by native speakers in Ireland etc)
    3K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Nov 15, 2023
    Announcing BREAD, a new benchmark for noisy text detection, and CRED, the scoring functions we open-source to solve the problem!
    arXiv logo
    arxiv.org
    Separating the Wheat from the Chaff with BREAD: An open-source...
    Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A...
    3.4K
  • user avatar
    iseeaswell꩜bʂky
    @iseeaswell
    Oct 29, 2020
    Replying to @iseeaswell
    For example, did you know how much ᏋᏁᎶᏝᎥᏕᏂ is written in Cherokee syllabics online? Or that because of the common Oromo 4-gram “essa”, a majority of web-crawled “Oromo” may actually be English sentences containing “essay” multiple times? 5/8

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement