Unicode (ISO 10646):
| characters | bits | encoding | ||
| U+0000 - U+007F (ASCII) | 7 | 0_b6_b5_b4_b3_b2_b1_b0 | ||
| U+0080 - U+07FF (Latin-1 through Arabic) | 11 | 1_1_0_b10_b9_b8_b7_b6 | 1_0_b5_b4_b3_b2_b1_b0 | |
| U+0800 - U+FFFD (remainder of BMP) | 16 | 1_1_1_0_b15_b14_b13_b12 | 1_0_b11_b10_b9_b8_b7_b6 | 1_0_b5_b4_b3_b2_b1_b0 |
number
of
chars range what
----- -------------- ------------------------------------
128 U+0000..U+007F Basic Latin
128 U+0080..U+00FF Latin-1 Supplement
128 U+0100..U+017F Latin Extended-A
156 U+0180..U+024F Latin Extended-B
89 U+0250..U+02AF IPA Extensions
57 U+02B0..U+02FF Spacing Modifier Letters
72 U+0300..U+036F Combining Diacritical Marks
105 U+0370..U+03FF Greek
230 U+0400..U+04FF Cyrillic
...
85 U+0530..U+058F Armenian
82 U+0590..U+05FF Hebrew
62 U+0600..U+06FF Arabic
...
87 U+0E00..U+0E7F Thai
65 U+0E80..U+0EFF Lao
...
40 U+10A0..U+10FF Georgian
...
348 U+1200..U+137F Ethiopic
...
128 U+1800..U+187F Mongolian cursive script
...
246 U+1E00..U+1EFF Latin Extended Additional
233 U+1F00..U+1FFF Greek Extended
77 U+2000..U+206F General Punctuation
28 U+2070..U+209F Superscripts and Subscripts
14 U+20A0..U+20CF Currency Symbols
20 U+20D0..U+20FF Combining Marks for Symbols
57 U+2100..U+214F Letterlike Symbols
48 U+2150..U+218F Number Forms
91 U+2190..U+21FF Arrows
242 U+2200..U+22FF Mathematical Operators
123 U+2300..U+23FF Miscellaneous Technical
37 U+2400..U+243F Control Pictures
11 U+2440..U+245F Optical Character Recognition
139 U+2460..U+24FF Enclosed Alphanumerics
128 U+2500..U+257F Box Drawing
22 U+2580..U+259F Block Elements
80 U+25A0..U+25FF Geometric Shapes
106 U+2600..U+26FF Miscellaneous Symbols
73 U+2700..U+27BF Dingbats
...
256 U+2800..U+28FF Braille Pattern Symbols
...
35 U+3000..U+303F CJK Symbols and Punctuation
87 U+3040..U+309F Hiragana
90 U+30A0..U+30FF Katakana
37 U+3100..U+312F Bopomofo
94 U+3130..U+318F Hangul Compatibility Jamo
...
69 U+3200..U+32FF Enclosed CJK Letters and Months
84 U+3300..U+33FF CJK Compatibility
...
18174 U+4E00..U+9FFF CJK Unified Ideographs
...
11172 U+AC00..U+D7A3 Hangul Syllables
...
270 U+F900..U+FAFF CJK Compatibility Ideographs
57 U+FB00..U+FB4F Alphabetic Presentation Forms
12 U+FB50..U+FDFF Arabic Presentation Forms-A
...
4 U+FE20..U+FE2F Combining Half Marks
28 U+FE30..U+FE4F CJK Compatibility Forms
26 U+FE50..U+FE6F Small Form Variants
140 U+FE70..U+FEFF Arabic Presentation Forms-B
171 U+FF00..U+FFEF Halfwidth and Fullwidth Forms
2 U+FFF0..U+FFFF Specials
UTF-8
first byte of multibyte UTF = 0xC0 to 0xFD; indicates length
successive bytes = 0x80 to 0xBF
bytes 0xFE and 0xFF are never used
C wchar_t, wprintf(L"Schöne Grüße!\n");
environment var LC_CTYPE (e.g. en_US.UTF-8 or de_DE.ISO_8859-1)
char *s;
int utf8_mode = 0;
if ((s = getenv("LC_ALL")) ||
(s = getenv("LC_CTYPE")) ||
(s = getenv("LANG"))) {
if (strstr(s, "UTF-8"))
utf8_mode = 1;
}