Digitally-disadvantaged languages

: Digitally-disadvantaged languages face multiple inequities in the digital sphere including gaps in digital support that obstruct access for speakers, poorly-designed digital tools that negatively affect the integrity of languages and writing systems, and unique vulnerabilities to surveillance harms for speaker communities. This term captures the acutely uneven digital playing field for speakers of the world’s 7000+ languages.

1. The Unicode Standard is a character coding system designed to support interoperable exchange and consistent representation of text in the world's writing systems on digital devices, providing a foundation for a multilingual digital sphere.
2. A language is a shared means of communication, while a script is the collection of written characters used to write a language. A language's writing system incorporates a script and a set of rules regarding its use. Languages and scripts do not have a one-to-one or static relationship. Some languages, such as Kazakh, Mongolian, and Urdu, are written in multiple scripts. Many languages share a script, although the rules of their writing systems may differ. More than 1000 languages are written in the Latin script, including English, French, Czech, Kazakh, Nahuatl, Tagalog, Vietnamese, and Igbo; Hindi, Nepali, Marathi, Bodi, and Konkani are among languages written in the Devanagari to harm through digital surveillance and under-moderation of language content.
Digitally-disadvantaged languages overlaps and extends upon adjacent terms used in geopolitics and computational linguistics, i.e., natural language processing (NLP). While the category of digitally-disadvantaged languages includes many if not all minoritised languages, Indigenous languages, oral languages, signed languages, and endangered languages, it also includes many national and widelyspoken languages that enjoy robust intergenerational transmission. 3 There is no sharp line that delineates whether a language is digitally-disadvantaged. Rather, the term captures a relative degree of disadvantage as compared to the handful of languages that enjoy the most comprehensive digital support and wider political advantages. That said, there are stark differences between the levels of support for languages such as English, Chinese, Spanish, and Arabic and even widely-spoken national and regional languages such as Amharic, Bulgarian, Tamil, Swahili, or Cebuano. However, digitally-disadvantaged is not a static state; it is possible for a language to "digitally ascend" (Kornai, 2013)  tratively, the QWERTY Latin character layout remains the default keyboard all over the world, leading many to write even well-supported languages like Arabic in a transliterated Latin form such as "Arabizi" (Zaugg, 2019a). The global spread of digital tools and systems including QWERTY keyboards, ASCII, 4 ICANN oversight of the originally Latin character-only domain name system, 5 and default English auto-correct have all contributed to the "logic" that English is the global lingua franca, and the Latin alphabet the most modern, rational, and universal script. 6 This "logic" in turn builds upon US and UK imperial power that laid the groundwork for the "digital revolution" as well as first brought English and the Latin script to far flung corners of the globe.
Digital advantage for English and the Latin script -and to a lesser degree other dominant languages and scripts -has created a paradigm in which many bilingual or multilingual speakers of digitally-disadvantaged languages become habituated to consuming and sharing content in a dominant "bully" language or script. 7 Many digitally-disadvantaged language speakers do not imagine that the digital sphere could be equally hospitable to their mother tongue and native script as it is to English and Latin (Benjamin, 2016). Unfortunately, gaps in digital support and use may be contributing to many of these languages' extinction as speakers increasingly use "bully" languages on and offline. Shockingly, 50-90% of language diversity is slated to be lost this century (Romaine, 2015); inequities in the digital sphere appear to be a factor in this shift (Kornai, 2013;Zaugg, 2017;Zaugg, 2019a;Zaugg, 2020).
The route out of digitally-disadvantaged status is "full stack support" 8 (Loomis,4. The American Standard Code for Information Interexchange, widely known as ASCII, assigned the Latin letters, numbers, and other characters common to American English to the 256 slots available in the 8-bit code. ASCII was the predominant character encoding standard pre-Unicode and is still used by many websites and devices today. 5. ICANN, or the Internet Corporation for Assigned Names and Numbers, is a U.S nonprofit and multistakeholder group that maintains the central repository for IP addresses and helps coordinate their supply while also managing the domain name system.
6. This digital "logic" perpetuates supremacist theories such as Jean-Jacques Rousseau's hypothesis in On the Origin of Language that "the depicting of objects is appropriate to a savage people; signs of words and of propositions, to a barbaric people; and the alphabet to civilised people" (1966, p. 17, as quoted in Lydia Liu, 2015. 7. Poet Bob Holman calls dominant languages that push out mother tongues "bully" languages (Grubin, 2015).
8. "Full-stack support" is similar to Kornai's (2013) definition of "digital vitality, " but the difference is that Kornai's definition encompasses both digital support and digital use. This is an important distinction because digital support does not necessarily lead to digital use of a language; long-standing lack of digital support may in fact incentivize bilingual/multilingual speakers to utilise a dominant, well-supported language for digital communication, such that these habits may be irre-Pandey, and Zaugg, 2017

EQUITABLE ACCESS
Equity, versus equality, acknowledges that each language community has unique circumstances and requires an allocation of resources and efforts to match, including potentially refusal of digital support. Issues with equitable access can fall anywhere on the "stack, " from fonts to support on popular social media platforms. For example, while Indic scripts are encoded within the Unicode Standard, disproportionately few Indic fonts exist, due in part to the technical difficulty of engineering such fonts and the historically low commercial interest in Indian markets. Support by major software vendors has also followed political and commercial interests, from prioritising national and "commercially-viable" scripts in early editions of the Unicode Standard (Zaugg, 2017), to the targeting by software localization vendors of Europe and Japan through the late 20th century (Oo, 2018).
Even for languages where typographic access is not a barrier, a major issue is a lack of integration methods through a "digital re-colonization" supposedly driven versible even if digital supports for their mother tongue later exist. In this context, it is possible for a language to be digitally-disadvantaged while also being well-supported.
10. Users on the popular streaming platform Twitch complained, for example, about the lack of Indigenous language tags available to help them find other members of their language communities, e.g. Basque and Gaelic (Sinclair, 2021). One example of lobbying working is Apple's attempts to support the nastaʿlīq script used to write Urdu (Kohari, 2021 workaround had to be implemented to enable Sami keyboard access, 11 with no mechanism for enabling proofing tools. iOS and Android require manual maintenance of separate keyboard apps, with limited operating system integration. It is presently not possible to provide a high-quality user experience for digitally-advantaged language speakers on these platforms. Many digitally-disadvantaged language communities include passionate advocates who have led grassroots efforts to develop fonts, keyboards, and word processing software for their languages and scripts (Zaugg, 2017;Zaugg, 2019a;Zaugg, 2020;Zaugg, forthcoming;Scannell, 2008;Bansal, 2021;Coffey, 2021;Kohari, 2021;Rosenberg, 2011;Wadell, 2016 11. The workaround was to add the keyboard as a variant under the majority language, as well as to write the necessary operating system extension to implement the actual keyboard functionality as well (i.e., the ability for a key press to input the necessary key input).

LANGUAGE AND SCRIPT INTEGRITY
While some efforts to support digitally-disadvantaged languages are well-grounded, others are based on superficial knowledge of languages and writing systems (Zaugg, forthcoming). A virtual keyboard is only useful if it includes all the characters a language utilises, and ideally has a layout optimised for the most commonly used characters, etc. A well-designed font that incorporates calligraphic traditions can elevate a script's readability and status; a poorly designed font can signal its devaluation compared to font-rich scripts such as Latin (Leddy, 2018). Tools such as auto-correct, spell-check, and predictive typing can speed input, but can also degrade a language's orthography, honorifics, and patterns of respectful address if developed without appropriate care.
A significant trend within NLP is reliance on "big data" approaches to solve language access issues, such as generating text-to-speech engines or automatic translation. This exacerbates the disadvantage of low-resource languages, as dominant languages receive better quality tools as the bulk of cultural discourse already exists in these languages. Optimistically, new approaches such as "transfer learning" may allow using higher-resourced languages to train models for lower-resourced languages. However, to avoid building linguistically-damaging or unwanted tools, computational linguists should commit to "decolonizing NLP" by only developing tools in partnership with and led by the interests of language communities (Bird, 2020).

SURVEILLANCE VULNERABILITIES
Even when digitally-disadvantaged languages achieve a baseline of digital support, knock-on challenges remain. For example, social media platforms do not adequately moderate content in these languages (Zaugg, 2019b;Fick & Dave, 2019;Martin & Sinpeng, 2021;Marinescu, 2021). Facebook in particular has failed to moderate hate speech and fake news in digitally-disadvantaged languages, leading to real world harms across the globe (Adegoke & BBC Africa Eye, 2018;Stevenson, 2018;Taye & Pallero, 2020).
Given that digitally-disadvantaged languages have a smaller mass of digitised content, data mining puts these communities at higher risk relative to dominant languages. The smaller the corpus, the higher the chance that individual privacy of community members will be invaded. Finding the balance between technological solutions and social responsibility is challenging. Ensuring that users are not surveilled, while simultaneously improving language tool quality, requires consentbased measures significantly beyond those provided by laws and regulations like GDPR. Privacy-protections are critical for digitally-disadvantaged language communities; surveillance capitalism will likely lead to disproportionately negative outcomes in these communities, as many are uniquely vulnerable to state, NGO, and corporate harms (Zaugg, 2019b). For example, digital tools have been used to surveil the Rohingya in Myanmar and Bangladesh (Aziz, 2021;Ortega, 2021), while U.S. Customs and Border Protection surreptitiously collects migrants' cell phone conversations and social media posts, using them to inform asylum decisions at the US-Mexico border (Korkmaz, 2020).
Some digitally-disadvantaged languages are of "strategic interest" to governments, and tools such as machine translation are built through military-intelligence funding to aid surveillance. Amandalynne Paullada (2021, n.p.) reminds us that a push for militarised surveillance is "precisely what fostered the development of machine translation technology in the mid-20th century" and its deployment today extends this tradition of "exerting power over subordinate groups. " Efforts towards digital justice for digitally-disadvantaged language communities must balance the fact that increased digital support for a language also increases its speaker community's legibility to surveilling actors, benevolent or malevolent. These languages require design solutions that maintain data privacy, sovereignty, 14 and safety within the digital sphere.

CONCLUSION
Digitally-disadvantaged languages face multiple inequities in the digital sphere, including gaps in digital support that obstruct access for speakers, poorly-designed digital tools that negatively affect the integrity of languages and writing systems, and unique vulnerabilities to surveillance harms for speaker communities. The term can bridge the work of a wide range of stakeholders who seek to study, discuss, and address language equity in the digital sphere, including scholars, NLP researchers, technologists, speaker communities, and language advocates.
14. For example, the Māori non-profit Te Hiku Media is working to build language tools for their community while keeping their annotated audio data, which can be used to develop automatic speech recognition and speech-to-text tools, out of the hands of corporate actors (Coffey, 2021).