language space standardization

  • These should be used for filtering tags

    • Main considerations for languages that are included in the list are official dominant languages of European and East Asian countries.
  • For Romance Languages

    • French
      • Uses the 26 latin alphabet with many diacritics
      • É [\u00C9\u00E9]
      • ÀÈÙ [\u00C0\u00E0\u00C8\u00E8\u00D9\u00F9]
      • ÂÊÎÔÛ [\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB]
      • ËÏÜŸ [\u00CB\u00EB\u00CF\u00EF\u00DC\u00FC\u00FF\u0178]
      • Ç [\u00C7\u00E7]
      • ligatures ÆŒ [\u00C6\u00E6\u0152\u0153]
    • Italian
      • Only uses ABCDEFGHILMNOPQRSTUVZ and nothing else
    • Portugese
      • Uses the 26 latin alphabet with many diacritics
      • ÁÉÍÓÚ [\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA]
      • ÂÊÔ [\u00C2\u00E2\u00CA\u00EA\u00D4\u00F4]
      • ÃÕ [\u00C3\u00E3\u00D5\u00F5]
      • Ç [\u00C7\u00E7]
    • Spanish
      • Uses the 26 latin alphabet with Ñ [\u00D1\u00F1]
      • Optionally uses accents marks ÁÉÍÓÚÝ [\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD]
    • Romanian (aka Moldovan)
      • Uses the 26 latin alphabets with many diacritics
      • ĂÂÎ [\u0103\u0102\u00C2\u00E2\u00CE\u00EE]
      • ȘȚ [\u0218\u0219\u021A\u021B]
  • For Germanic Languages:

    • German
      • Uses the 26 latin alphabet with many diacritics
      • ÄÖÜ [\u00C4\u00E4\u00D6\u00F6\u00DC\u00FC]
      • ß [\u1E9E\u00DF]
    • Dutch
      • Uses the 26 latin alphabet with no extras
    • Danish/Norwegian
      • Uses the 26 latin alphabets with many diacritics
      • Æ [\u00C6\u00E6]
      • Ø [\u00D8\u00F8]
      • Å [\u00C5\u00E5]
    • Swedish
      • Uses the 26 latin alphabets with many diacritics
      • ÄÖ [\u00C4\u00E4\u00D6\u00F6]
      • Å [\u00C5\u00E5]
    • Icelandic
      • Uses ABDEFGHIJKLMNOPRSTUVXY plus too many extras
      • ÁÉÍÓÚÝ [\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD]
      • Æ [\u00C6\u00E6]
      • Ö [\u00D6\u00F6]
      • ÐÞ [\u00D0\u00F0\u00DE\u00FE]
  • For Uralic languages

    • Finnish
      • Uses the 26 latin alphabets with many diacritics
      • ÄÖ [\u00C4\u00E4\u00D6\u00F6]
      • Å [\u00C5\u00E5]
    • Estonian
      • Uses the latin alphabet ABDEFGHIJKLMNOPRSZTUV with diacritics
      • (loan word alphabets does exist)
      • Õ [\u00D5\u00F5]
      • ÄÖÜ [\u00C4\u00E4\u00D6\u00F6\u00DC\u00FC]
      • ŠŽ [\u0160\u0161\u017D\u017E]
    • Hungarian
      • Uses the 26 latin alphabet (extended version) with diacritics
      • ÁÉÍÓÚ [\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA]
      • ÖÜ [\u00D6\u00F6\u00DC\u00FC]
      • ŐŰ [\u0150\u0151\u0170\u0171]
  • For Chinese (Mainland, Taiwan and Hong Kong) and Japanese:

    • CJK Ideograph [\u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\U00020000-\U0002A6DF\U0002A700-\U0002EBEF\U0002F800-\U0002FA1F] (is a must for Chinese and Japanese)
    • Katakana and Hirigana [\u3040-\u30FF\u31F0-\u31FF\U0001B130-\U0001B16F] (is a must for Japanese)
    • Zhuyin Bopomofo [\u3100-\u312F\u31A0-\u31BF] (is a must for Taiwanese)
    • CJK Symbols and Punctuation [\u3000-\u303F\uFE30-\uFE4F] (is a must for sentences in Chinese and Japanese)
    • Japanese Halfwidth and Fullwidth [\uFF00-\uFFEF] (includes ASCII, so beware)
    • Hentaigana (alternates) [\U0001B000-\U0001B12F]
    • CJK special symbols [\u3200-\u33FF\U0001F200-\U0001F2FF] (used in Japanese similar to emojis)
    • Kangxi Radicals for dictionaries [\u2E80-\u2FDF]
    • CJK Stroke Characters [\u31C0-\u31EF]
    • Ideographic Description Characters (combiners) [\u2FF0-\u2FFF]
  • For Cyrillic: [\u0400–\u052F\u2DE0–\u2DFF\uA640–\uA69F\u1C80–\u1C8F\u1D2B\u1D78\uFE2E\uFE2F]

    • Slavic Alphabet [\u0400-\u04FF] (is a must for most Slavic languages and Abkhazian)
    • Caucasian Alphabet supplement [\u0500-\u052F] (will be needed for Abkhazian)
    • Old Cyrillic Alphabet [\u2DE0-\u2DFF\uA640-\uA69F\u1C80-\u1C8F]
    • Phonetic Supplement [\u1D2B\u1D78]
    • Combining Half Marks [\uFE2E\uFE2F]
  • For Greek and Coptic:

  • For Korean (Hangul): [\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F\uA960-\uA97F\uD7B0-\uD7FF]

    • Korean Syllables [\uAC00-\uD7AF] (is a must for Korean)
    • Modern Jamo/alphabet [\u1100-\u11FF\u3130-\u318F]
    • Archaic Jamo/alphabet [\uA960-\uA97F\uD7B0-\uD7FF]
  • For most southeast asian countries:

    • For Thai: [\u0E00-\u0E7F]
    • For Lao: [\u0E80-\u0EFF]
    • For Burmese (Myanmar): [\u1000-\u109F\uAA60-\uAA7F\uA9E0-\uA9FF]
    • For Khmer (Cambodian): [\u1780-\u17FF\u19E0-\u19FF]
  • Vietnamese, Tagalog, Malay/Indonesian uses English/Latin so it is harder to deal with

    • Vietnamese uses ABCDEGHIKLMNOPQRSTUVXY and other extra glyphs
      • circumflexes  [\u00C2\u00E2] Ê [\u00CA\u00EA] Ô [\u00D4\u00F4]
      • language specific Ă [\u0102\u0103] Đ [\u0110\u0111] Ơ [\u01A0\u01A1] Ư [\u01AF\u01B0]
      • tone marks include [\u0341\u0340\u0309\u0303\u0323] and the deprecated [\u0301\u0300]
      • pre-combined characters in [\u00c0\u00c1\u00c3\u00c8\u00c9\u00cc\u00cd\u00d2\u00d3\u00d5\u00d9\u00da\u00dd\u00e0\u00e1\u00e3\u00e8\u00e9\u00ec\u00ed\u00f2\u00f3\u00f5\u00f9\u00fa\u00fd\u0128\u0129\u0168\u0169\u1EA0-\u1EF9]
    • Tagalog (Filipino) uses all 26 Latin alphabets AND Latin (Spanish) Ñ [\u00D1\u00F1]
      • The Abakada variant only has 20 Latin alphabets ABKDEGHILMNOPRSTUWY and no extra glyphs
    • Malay/Indonesian uses all 26 Latin alphabets without extra glyphs
      • The Malay Za'aba alphabet included Ă [\u0102\u0103] and Ï [\u00CF\u00EF] along side Latin
      • The Indonesian Soewandi alphabet included É [\u00C9\u00E9] along side Latin
      • The Jawi alphabet is Arabic-based and Right-To-Left, so JUST NO
  • For all south asian countries:

    • For Hindi (India) and Nepali Devanagari: [\u0900-\u097F\uA8E0-\uA8FF\u1CD0-\u1CFF]
    • For Bengali (Bengladesh): [\u0980-\u09FF]
    • For Tibetian and Dzongkha (Bhutan): [\u0F00-\u0FFF]
    • For Sri Lanka:
      • Sinhala [\u0D80-\u0DFF\U000111E0-\U000111FF]
      • Tamil [\u0B80-\u0BFF\U00011FC0-\U00011FFF]
    • For Farsi (Iran) and Tajik:
      • The Latin alphabet uses "ABCDEFGHIJKLMNOPQRSTUVXZ" and some special characters
        • Çç [\u00c7\u00e7] Şş [\u015e\u015f]
        • Īī [\u012a\u012b] Ūū [\u016a\u016b]
        • Ƣƣ [\u01a2\u01a3] Ƶƶ [\u01b5\u01b6]
        • Apostrophe for glottal stop
      • They have a Cyrillic alphabet set, see Cyrillic section
      • Persian alphabet and Bukhori jewish is Right-to-Left so NOOO
    • Urdu (Pakistan) and Maldivian are both RTL, so NO
Edited by lastpass