👈

The Semware Editor

Code Pages

  Index


Intro

Single-byte character encodings

Multi-byte character encodings

Querying and setting a code page

Tested system locales and their code page

Code pages side by side




  Intro


Disclaimer:
This document is limited in its viewpoints by focussing on computer files, Windows, languages of Western European origin, and relevance to TSE. It might have no relevance to peripherals, like printers, and especially has no relevance to keyboards.

 

People use characters to read and write. Computers use numbers. Therefore people assigned numbers to characters so that computers could use "characters".

A character set is a defined group of characters.

A character encoding is a character set for which each character has a defined number.

Historically countries and companies started out creating their own character encodings by the hundreds.
We still live with the remnants of that legacy, and that is a problem.

The following examples are huge simplifications. The rest of this document goes into the details that make the example behaviors happen.

Example:
Suppose an American writes a "√" (square root) character in a file using the Console variant of TSE.
If a non-American opens that file using the Console variant of TSE, then they see the "¹" (superscript one) character instead.
If anyone opens that file using the GUI or Linux variant of TSE, then they see the "û" (small letter u with circumflex) character instead.

Fortunately, over time some character encodings emerged as more commonly used, and nowadays the Unicode character encodings are a world standard.

Unfortunately, TSE does not natively support Unicode.

Example:
If you copy a file or content from the internet, or open a file that was created with a modern editor, then the odds are that the file was created using one of the Unicode encodings. This makes TSE show 2 to 4 inappropriate characters for some of the file's original characters.

In conclusion:
If TSE shows some weird or unexpected characters in a text, then read the rest of this document.




  Single-byte character encodings


Because computer memory was extremely expensive, the initial trend was to use an as small as possible-sized number per character. Such a number is called a byte . Not coindidentally a byte is also the smallest addressable chunk of computer memory.

A byte size of 8 bits emerged as the most-used standard. Formally such a byte is called an "octet" after the Latin word for 8, but in English everybody calls it a "byte".

Bits can one-on-one be represented by binary numbers . An 8-bit binary number can express the decimal numbers 0 through 255. Byte-sized character numbers therefore enforce a limit of 256 characters.

For languages that started as Western European, this limit could contain all characters necessary for ordinary reading and writing.

The first 128 characters of each such a character encoding were standardized under the name "ASCII", and are the same for all such languages.

Initially each country and company created its own single-byte character encoding, in which the last 128 characters differed per country and language. This let to a huge amount of character encodings.

Nowadays, in Windows, in practice, for single-byte character encodings, that huge amount has been reduced to 3 mainly used ones.

Microsoft refers to each of its own character encodings as a code page, and identifies each code page by a code page number.

(There are other companies with their own "code pages" and their own terminology.)

Microsoft's mainly used single-byte code pages are:

Editing a file that uses a single-byte character encoding will always show a character for each byte-value, but it might be a wrong character if the editor assumes a wrong character encoding.

A file's single-byte character encoding is not stored as a property of that file. An editor opening a file with single-byte character-encoding has no reliable way to determine which code page the file's creator used.

Sometimes the user can make a best guess based on context. Then other editors let the user on-the-fly change the code page they use for showing the file.

With TSE, if the user's guessed code page implements a single-byte character encoding, then the user can close the editor, type "chcp <guessed code page>" on the command prompt, and (re)start the Console variant of TSE to edit the file.

TSE natively only supports single-byte character encodings.

Using TSE to edit a file that uses a multi-byte character encoding will show multiple wrong characters per each multi-byte character.

TSE and single-byte code pages:

Be aware of this both helpful and misleading Windows command prompt feature:




  Multi-byte character encodings


Nowadays single-byte character encodings still exist for historical reasons, but modern technologies and applications default use one of Unicode's multi-byte character encodings.

For example:

"UTF-…" are names for Unicode character encodings.
Unicode has 1 character set and 5 main character encodings for that character set.
The character set is updated once a year in September.

A code point is Unicode's administrative number for a character.

Because the code point itself is not a character encoding, a code point is ubiquitously useful for uniquely identifying a character across character encodings, environments, and other contexts.

Unicode "UTF-…" character encodings can also be referred to by a Microsoft code page number:

Unicode nameCode page numberBytes per character (*)
UTF-8650011 to 4
UTF-16LE12002 or 4
UTF-16BE12012 or 4
UTF-32LE120004
UTF-32BE120014
(*) depending on the character.

GUI TSE and Linux TSE can use extensions to handle "Unicode files".

Unicode files are files in which the characters are encoded using one of the Unicode encodings.

Unicode files can optionally contain an invisible leading character (a "byte order mark", or "BOM"), that indicates which Unicode encoding was used.

Fortunately, even if no byte order mark was used, it is usually easy to with a very high reliability automatically deduce that a file is Unicode encoded and which Unicode encoding was used.




  Querying and setting a Windows code page


In Windows you can set a code page at 2 levels:

Local code page:

A Windows command prompt default uses the global code page, but can set a local code page.

You can query a command prompt's code page with the command "chcp" without parameters:
      chcp

You change a command prompt's code page with the command "chcp <code page number>":
      chcp 1252

Note that there is a huge amount of prompt commands and programs that ignore the local code page and always use the global one.

Console TSE will use the command prompt's code page.

GUI TSE will normally use code page 1252, whatever the global or local code page is.

You cannot change the code page that TSE uses from withing TSE.

Exception to the previous two points:
GUI TSE will use the global code page if GUI TSE's font is "Terminal".

When you start a .bat or .cmd file or a command from TSE, then you have the option to begin the command (file) with a "chcp <code page number>" command to set the desired character encoding for its output.

Again, be aware that lots of commands and programs ignore the local code page and use the global code page instead.

Global code page:

You cannot change Windows' global code page directly.

You have to change the Windows "system locale" setting, which has a "Language (Country)" format, and then Windows secretly sets its global code page based on that.

In Windows 10 you can change the system locale with Settings → Time & Language → Language → Administrative language settings → Change system locale.

Warning:
If you check-mark the "Beta: Use Unicode UTF-8 for worldwide language support" setting, then this overrules the system locale's code page, and sets the global code page to 65001 (Unicode's UTF-8 encoding).
This is an interesting option, but be aware of its impact on all command prompts and spawned command files.

Because many system locales share a same code page, often nothing happens when you change the system locale, but when the system locale changes the code page too, then you will be asked to restart Windows to activate the change.

In the GUI and Console variants of TSE you can query the global code page with the TSE macro statement "Query(CodePage)".
Do not set this internal TSE variable to a different value: It does not affect which code page TSE uses, and sabotages macros from having a convenient way to identify the global code page.




  Tested system locales and their code page


System LocaleCode Page
Danish (Denmark)850
Dutch (Belgium)850
Dutch (Netherlands)850
English (Australia)850
English (Canada)850
English (New Zeeland)850
English (United Kingdom)850
English (United States)437
French (Belgium)850
French (Canada)850
French (France)850
German (Austria)850
German (Germany)850
German (Liechtenstein)850
German (Luxembourg)850
German (Switzerland)850
Norwegian (Bokmål)850
Norwegian (Nynorsk)850
Portugese (Brazil)850
Portugese (Portugal)850
Spanish (Chile)850
Spanish (Latin America)850
Spanish (Mexican)850
Spanish (Spain)850
Spanish (United States)850



  Code pages side by side


Click a code page's header to view its additional info on Wikipedia.

 

Byte Number Code Page 437 Code Page 850 Code Page 1252
Decimal Hexadecimal Character Code Point Character Code Point Character Code Point
00 NUL0000 NUL0000 NUL0000
11 ☺︎263A ☺︎263A SOH0001
22 263B 263B STX0002
33 ♥︎2665 ♥︎2665 ETX0003
44 ♦︎2666 ♦︎2666 EOT0004
55 ♣︎2663 ♣︎2663 ENQ0005
66 ♠︎2660 ♠︎2660 ACK0006
77 2022 2022 BEL0007
88 25D8 25D8 BS0008
99 25CB 25CB HT0009
10A 25D9 25D9 LF000A
11B ♂︎2642 ♂︎2642 VT000B
12C ♀︎2640 ♀︎2640 FF000C
13D 266A 266A CR000D
14E 266B 266B SO000E
15F 263C 263C SI000F
1610 25BA 25BA DLE0010
1711 25C4 25C4 DC10011
1812 ↕︎2195 ↕︎2195 DC20012
1913 ‼︎203C ‼︎203C DC30013
2014 00B6 00B6 DC40014
2115 §00A7 §00A7 NAK0015
2216 25AC 25AC SYN0016
2317 21A8 21A8 ETB0017
2418 2191 2191 CAN0018
2519 2193 2193 EM0019
261A 2192 2192 SUB001A
271B 2190 2190 ESC001B
281C 221F 221F FS001C
291D ↔︎2194 ↔︎2194 GS001D
301E 25B2 25B2 RS001E
311F 25BC 25BC US001F
3220 SP0020 SP0020 SP0020
3321 !0021 !0021 !0021
3422 "0022 "0022 "0022
3523 #0023 #0023 #0023
3624 $0024 $0024 $0024
3725 %0025 %0025 %0025
3826 &0026 &0026 &0026
3927 '0027 '0027 '0027
4028 (0028 (0028 (0028
4129 )0029 )0029 )0029
422A *002A *002A *002A
432B +002B +002B +002B
442C ,002C ,002C ,002C
452D -002D -002D -002D
462E .002E .002E .002E
472F /002F /002F /002F
4830 00030 00030 00030
4931 10031 10031 10031
5032 20032 20032 20032
5133 30033 30033 30033
5234 40034 40034 40034
5335 50035 50035 50035
5436 60036 60036 60036
5537 70037 70037 70037
5638 80038 80038 80038
5739 90039 90039 90039
583A :003A :003A :003A
593B ;003B ;003B ;003B
603C <003C <003C <003C
613D =003D =003D =003D
623E >003E >003E >003E
633F ?003F ?003F ?003F
6440 @0040 @0040 @0040
6541 A0041 A0041 A0041
6642 B0042 B0042 B0042
6743 C0043 C0043 C0043
6844 D0044 D0044 D0044
6945 E0045 E0045 E0045
7046 F0046 F0046 F0046
7147 G0047 G0047 G0047
7248 H0048 H0048 H0048
7349 I0049 I0049 I0049
744A J004A J004A J004A
754B K004B K004B K004B
764C L004C L004C L004C
774D M004D M004D M004D
784E N004E N004E N004E
794F O004F O004F O004F
8050 P0050 P0050 P0050
8151 Q0051 Q0051 Q0051
8252 R0052 R0052 R0052
8353 S0053 S0053 S0053
8454 T0054 T0054 T0054
8555 U0055 U0055 U0055
8656 V0056 V0056 V0056
8757 W0057 W0057 W0057
8858 X0058 X0058 X0058
8959 Y0059 Y0059 Y0059
905A Z005A Z005A Z005A
915B [005B [005B [005B
925C \005C \005C \005C
935D ]005D ]005D ]005D
945E ^005E ^005E ^005E
955F _005F _005F _005F
9660 `0060 `0060 `0060
9761 a0061 a0061 a0061
9862 b0062 b0062 b0062
9963 c0063 c0063 c0063
10064 d0064 d0064 d0064
10165 e0065 e0065 e0065
10266 f0066 f0066 f0066
10367 g0067 g0067 g0067
10468 h0068 h0068 h0068
10569 i0069 i0069 i0069
1066A j006A j006A j006A
1076B k006B k006B k006B
1086C l006C l006C l006C
1096D m006D m006D m006D
1106E n006E n006E n006E
1116F o006F o006F o006F
11270 p0070 p0070 p0070
11371 q0071 q0071 q0071
11472 r0072 r0072 r0072
11573 s0073 s0073 s0073
11674 t0074 t0074 t0074
11775 u0075 u0075 u0075
11876 v0076 v0076 v0076
11977 w0077 w0077 w0077
12078 x0078 x0078 x0078
12179 y0079 y0079 y0079
1227A z007A z007A z007A
1237B {007B {007B {007B
1247C |007C |007C |007C
1257D }007D }007D }007D
1267E ~007E ~007E ~007E
1277F 2302 2302 DEL007F
12880 Ç00C7 Ç00C7 20AC
12981 ü00FC ü00FC UNUSED
13082 é00E9 é00E9 201A
13183 â00E2 â00E2 ƒ0192
13284 ä00E4 ä00E4 201E
13385 à00E0 à00E0 2026
13486 å00E5 å00E5 2020
13587 ç00E7 ç00E7 2021
13688 ê00EA ê00EA ˆ02C6
13789 ë00EB ë00EB 2030
1388A è00E8 è00E8 Š0160
1398B ï00EF ï00EF 2039
1408C î00EE î00EE Œ0152
1418D ì00EC ì00EC UNUSED
1428E Ä00C4 Ä00C4 Ž017D
1438F Å00C5 Å00C5 UNUSED
14490 É00C9 É00C9 UNUSED
14591 æ00E6 æ00E6 2018
14692 Æ00C6 Æ00C6 2019
14793 ô00F4 ô00F4 201C
14894 ö00F6 ö00F6 201D
14995 ò00F2 ò00F2 2022
15096 û00FB û00FB 2013
15197 ù00F9 ù00F9 2014
15298 ÿ00FF ÿ00FF ˜02DC
15399 Ö00D6 Ö00D6 2122
1549A Ü00DC Ü00DC š0161
1559B ¢00A2 ø00F8 203A
1569C £00A3 £00A3 œ0153
1579D ¥00A5 Ø00D8 UNUSED
1589E 20A7 ×00D7 ž017E
1599F ƒ0192 ƒ0192 Ÿ0178
160A0 á00E1 á00E1 NBSP00A0
161A1 í00ED í00ED ¡00A1
162A2 ó00F3 ó00F3 ¢00A2
163A3 ú00FA ú00FA £00A3
164A4 ñ00F1 ñ00F1 ¤00A4
165A5 Ñ00D1 Ñ00D1 ¥00A5
166A6 ª00AA ª00AA ¦00A6
167A7 º00BA º00BA §00A7
168A8 ¿00BF ¿00BF ¨00A8
169A9 2310 ®00AE ©00A9
170AA ¬00AC ¬00AC ª00AA
171AB ½00BD ½00BD «00AB
172AC ¼00BC ¼00BC ¬00AC
173AD ¡00A1 ¡00A1 SHY00AD
174AE «00AB «00AB ®00AE
175AF »00BB »00BB ¯00AF
176B0 2591 2591 °00B0
177B1 2592 2592 ±00B1
178B2 2593 2593 ²00B2
179B3 2502 2502 ³00B3
180B4 2524 2524 ´00B4
181B5 2561 Á00C1 µ00B5
182B6 2562 Â00C2 00B6
183B7 2556 À00C0 ·00B7
184B8 2555 ©00A9 ¸00B8
185B9 2563 2563 ¹00B9
186BA 2551 2551 º00BA
187BB 2557 2557 »00BB
188BC 255D 255D ¼00BC
189BD 255C ¢00A2 ½00BD
190BE 255B ¥00A5 ¾00BE
191BF 2510 2510 ¿00BF
192C0 2514 2514 À00C0
193C1 2534 2534 Á00C1
194C2 252C 252C Â00C2
195C3 251C 251C Ã00C3
196C4 2500 2500 Ä00C4
197C5 253C 253C Å00C5
198C6 255E ã00E3 Æ00C6
199C7 255F Ã00C3 Ç00C7
200C8 255A 255A È00C8
201C9 2554 2554 É00C9
202CA 2569 2569 Ê00CA
203CB 2566 2566 Ë00CB
204CC 2560 2560 Ì00CC
205CD 2550 2550 Í00CD
206CE 256C 256C Î00CE
207CF 2567 ¤00A4 Ï00CF
208D0 2568 ð00F0 Ð00D0
209D1 2564 Ð00D0 Ñ00D1
210D2 2565 Ê00CA Ò00D2
211D3 2559 Ë00CB Ó00D3
212D4 2558 È00C8 Ô00D4
213D5 2552 ı0131 Õ00D5
214D6 2553 Í00CD Ö00D6
215D7 256B Î00CE ×00D7
216D8 256A Ï00CF Ø00D8
217D9 2518 2518 Ù00D9
218DA 250C 250C Ú00DA
219DB 2588 2588 Û00DB
220DC 2584 2584 Ü00DC
221DD 258C ¦00A6 Ý00DD
222DE 2590 Ì00CC Þ00DE
223DF 2580 2580 ß00DF
224E0 α03B1 Ó00D3 à00E0
225E1 ß00DF ß00DF á00E1
226E2 Γ0393 Ô00D4 â00E2
227E3 π03C0 Ò00D2 ã00E3
228E4 Σ03A3 õ00F5 ä00E4
229E5 σ03C3 Õ00D5 å00E5
230E6 µ00B5 µ00B5 æ00E6
231E7 τ03C4 þ00FE ç00E7
232E8 Φ03A6 Þ00DE è00E8
233E9 Θ0398 Ú00DA é00E9
234EA Ω03A9 Û00DB ê00EA
235EB δ03B4 Ù00D9 ë00EB
236EC 221E ý00FD ì00EC
237ED φ03C6 Ý00DD í00ED
238EE ε03B5 ¯00AF î00EE
239EF 2229 ´00B4 ï00EF
240F0 2261 SHY00AD ð00F0
241F1 ±00B1 ±00B1 ñ00F1
242F2 2265 2017 ò00F2
243F3 2264 ¾00BE ó00F3
244F4 2320 00B6 ô00F4
245F5 2321 §00A7 õ00F5
246F6 ÷00F7 ÷00F7 ö00F6
247F7 2248 ¸00B8 ÷00F7
248F8 °00B0 °00B0 ø00F8
249F9 2219 ¨00A8 ù00F9
250FA ·00B7 ·00B7 ú00FA
251FB 221A ¹00B9 û00FB
252FC 207F ³00B3 ü00FC
253FD ²00B2 ²00B2 ý00FD
254FE 25A0 25A0 þ00FE
255FF NBSP00A0 NBSP00A0 ÿ00FF

 

NBSP is the "no break space" character.
SHY is the "soft hyphen" character.
"₧" is one character, signifying the "pesetas" currency.


These webpages are created and maintained with The SemWare Editor Professional