Code Pages

Index

Tested system locales and their global code page

Intro

Disclaimer:
This document is limited in its viewpoints by focussing on computer files, Windows, languages of Western European origin, and relevance to TSE. It might have no relevance to peripherals, like printers, and especially has no relevance to keyboards.

People use characters to read and write. Computers use numbers. Therefore people assigned numbers to characters so that computers could use "characters".

A character set is a defined group of characters.

A character encoding is a character set for which each character has a defined number.

Historically countries and companies started out creating their own character encodings by the hundreds.
We still live with the remnants of that legacy, and that is a problem.

The following examples are huge simplifications. The rest of this document goes into the details that make the example behaviors happen.

Example:
Suppose an American writes a "√" (square root) character in a file using the Console variant of TSE.
If a non-American opens that file using the Console variant of TSE, then they see the "¹" (superscript one) character instead.
If anyone opens that file using the GUI or Linux variant of TSE, then they see the "û" (small letter u with circumflex) character instead.

Fortunately, over time some character encodings emerged as more commonly used, and nowadays the Unicode character encodings are a world standard.

Unfortunately, TSE does not natively support Unicode.

Example:
If you copy a file or content from the internet, or open a file that was created with a modern editor, then the odds are that the file was created using one of the Unicode encodings. This makes TSE show 2 to 4 inappropriate characters for some of the file's original characters.

In conclusion:
If TSE shows some weird or unexpected characters in a text, then read the rest of this document.

Single-byte character encodings

Because computer memory was extremely expensive, the initial trend was to use an as small as possible-sized number per character. Such a number is called a byte . Not coindidentally a byte is also the smallest addressable chunk of computer memory.

A byte size of 8 bits emerged as the most-used standard. Formally such a byte is called an "octet" after the Latin word for 8, but in English everybody calls it a "byte".

Bits can one-on-one be represented by binary numbers . An 8-bit binary number can express the decimal numbers 0 through 255. Byte-sized character numbers therefore enforce a limit of 256 characters.

For languages that started as Western European, this limit could contain all characters necessary for ordinary reading and writing.

The first 128 characters of each such a character encoding were standardized under the name "ASCII", and are the same for all such languages.

Initially each country and company created its own single-byte character encoding, in which the last 128 characters differed per country and language. This let to a huge amount of character encodings.

Nowadays, in Windows, in practice, for single-byte character encodings, that huge amount has been reduced to 3 mainly used ones.

Microsoft refers to each of its own character encodings as a code page, and identifies each code page by a code page number.

(There are other companies with their own "code pages" and their own terminology.)

Microsoft's mainly used single-byte code pages are:

437 English (United States)
Typically used for the Windows command prompt, .bat and .cmd scripts, and the Console version of TSE.
Without the need for accented characters, this character encoding has lots of (line) drawing-characters.
850 Other languages with a Western European origin, including English outside the United States.
Typically used for the Windows command prompt, .bat and .cmd scripts, and the Console version of TSE.
This character encoding gave up a part of the (line) drawing-characters to be able to provide accented characters for other languages.
1252 A Windows standard for languages with a Western European origin.
Typically used for Windows GUI applications that still use a single-byte character encoding.
This might be the single-byte character encoding that supports the most languages.
Its character encoding gave up all (line) drawing-characters to support even more languages and to add some more commonly used not-alphabetic characters.
According to Wikipedia Windows-1252 is the most-used single-byte character encoding in the world.

Editing a file that uses a single-byte character encoding will always show a character for each byte-value, but it might be a wrong character if the editor assumes a wrong character encoding.

A file's single-byte character encoding is not stored as a property of that file. An editor opening a file with single-byte character-encoding has no reliable way to determine which code page the file's creator used.

Sometimes the user can make a best guess based on context. Then other editors let the user on-the-fly change the code page they use for showing the file.

With TSE, if the user's guessed code page implements a single-byte character encoding, then the user can close the editor, type "chcp <guessed code page>" on the command prompt, and (re)start the Console variant of TSE to edit the file.

TSE natively only supports single-byte character encodings.

Using TSE to edit a file that uses a multi-byte character encoding will show multiple wrong characters per each multi-byte character.

TSE and single-byte code pages:

The Console variant of TSE uses the local code page of its command prompt.
If the local code page is a multi-byte one, then TSE only shows displayable ASCII characters (32 - 126) and a replacement character for each non-ASCII byte.
The Linux variant of TSE uses code page 1252.
The GUI variant of TSE uses code page 1252 for its default "Courier New" font and most other fonts.
The GUI variant of TSE with font "Terminal" will use Windows' global code page if that is a single-byte code page.
If the global code page is a multi-byte one, then TSE shows displayable ASCII characters (32 - 126) and a replacement character for each non-ASCII byte.

Be aware of this both helpful and misleading Windows command prompt feature:

If its code page limits a command prompt to a one-byte character set, a Windows prompt command like "dir" will still show names with multi-byte characters correctly when outputted directly to the screen.
However, if that same output is redirected, then multi-byte characters in a name will be converted to a "?" (question mark).

Multi-byte character encodings

Nowadays single-byte character encodings still exist for historical reasons, but modern technologies and applications default use one of Unicode's multi-byte character encodings.

For example:

Windows nowadays uses UTF-16LE for its APIs.
The Windows "system locale" setting has a beta checkmark for "Unicode", which makes "cmd" command prompts use UTF-8.
Windows Office documents internally use UTF-8.
The web mostly uses UTF-8.

"UTF-…" are names for Unicode character encodings.
Unicode has 1 character set and 5 main character encodings for that character set.
The character set is updated once a year in September.

A code point is Unicode's administrative number for a character.

Because the code point itself is not a character encoding, a code point is ubiquitously useful for uniquely identifying a character across character encodings, environments, and other contexts.

Unicode "UTF-…" character encodings can also be referred to by a Microsoft code page number:

Unicode name Code page number Bytes per character (*)

UTF-8 65001 1 to 4

UTF-16LE 1200 2 or 4

UTF-16BE 1201 2 or 4

UTF-32LE 12000 4

UTF-32BE 12001 4

(*) depending on the character.

Unicode name	Code page number	Bytes per character (*)
UTF-8	65001	1 to 4
UTF-16LE	1200	2 or 4
UTF-16BE	1201	2 or 4
UTF-32LE	12000	4
UTF-32BE	12001	4

GUI TSE and Linux TSE can use extensions to handle "Unicode files".

Unicode files are files in which the characters are encoded using one of the Unicode encodings.

Unicode files can optionally contain an invisible leading character, namely a "byte order mark", aka "BOM", that indicates which Unicode encoding is used.

Fortunately, even if no byte order mark is used, it is usually easy to automatically deduce if a file is Unicode encoded and which Unicode encoding was used with a very high reliability.

Querying and setting a Windows code page

In Windows you can set a code page at 2 levels:

A global code page for Windows, which makes it the default for all future command prompts and for the large amount of programs that ignore a local code page.
A local code page for the current command prompt.

Local code page:

A Windows command prompt default uses the global code page, but can set a local code page.

You can query a command prompt's code page with the command "chcp" without parameters:
chcp

You change a command prompt's code page with the command "chcp <code page number>":
chcp 1252

Note that there is a huge amount of prompt commands and programs that ignore the local code page and always use the global one.

Console TSE will use the command prompt's code page.

GUI TSE will normally use code page 1252, whatever the global or local code page is.

You cannot change the code page that TSE uses from withing TSE.

Exception to the previous two points:
GUI TSE will use the global code page if GUI TSE's font is "Terminal".

Exception to the exception:
GUI TSE will use code page 437 if GUI TSE's font is "Terminal" with font size 5.

When you start a .bat or .cmd file or a command from TSE, then you have the option to begin the command (file) with a "chcp <code page number>" command to set the desired character encoding for its output.

Again, be aware that lots of commands and programs ignore the local code page and use the global code page instead.

Global code page:

You cannot change Windows' global code page directly.

You have to change the Windows "system locale" setting, which has a "Language (Country)" format, and then Windows secretly sets its global code page based on that.

In Windows 10 you can change the system locale with Settings → Time & Language → Language → Administrative language settings → Change system locale.

Warning:
If you check-mark the "Beta: Use Unicode UTF-8 for worldwide language support" setting, then this overrules the system locale's code page, and sets the global code page to 65001 (Unicode's UTF-8 encoding).
This is an interesting option, but be aware of its impact on all command prompts and spawned command files.

Because many system locales share a same code page, often nothing happens when you change the system locale, but when the system locale changes the code page too, then you will be asked to restart Windows to activate the change.

In the GUI and Console variants of TSE you can query the global code page with the TSE macro statement "Query(CodePage)".
Do not set this internal TSE variable to a different value: It does not affect which code page TSE uses, and sabotages macros from having a convenient way to identify the global code page.

Tested system locales and their global code page

System Locale	Code Page
Danish (Denmark)	850
Dutch (Belgium)	850
Dutch (Netherlands)	850
English (Australia)	850
English (Canada)	850
English (New Zeeland)	850
English (United Kingdom)	850
English (United States)	437
French (Belgium)	850
French (Canada)	850
French (France)	850
German (Austria)	850
German (Germany)	850
German (Liechtenstein)	850
German (Luxembourg)	850
German (Switzerland)	850
Norwegian (Bokmål)	850
Norwegian (Nynorsk)	850
Portugese (Brazil)	850
Portugese (Portugal)	850
Spanish (Chile)	850
Spanish (Latin America)	850
Spanish (Mexican)	850
Spanish (Spain)	850
Spanish (United States)	850

Code pages side by side

Click a code page's header to view its additional info on Wikipedia.

Byte Number		Code Page 437		Code Page 850		Code Page 1252
Decimal	Hexadecimal	Character	Code Point	Character	Code Point	Character	Code Point
0	0	NUL	0000	NUL	0000	NUL	0000
1	1	☺︎	263A	☺︎	263A	SOH	0001
2	2	☻	263B	☻	263B	STX	0002
3	3	♥︎	2665	♥︎	2665	ETX	0003
4	4	♦︎	2666	♦︎	2666	EOT	0004
5	5	♣︎	2663	♣︎	2663	ENQ	0005
6	6	♠︎	2660	♠︎	2660	ACK	0006
7	7	•	2022	•	2022	BEL	0007
8	8	◘	25D8	◘	25D8	BS	0008
9	9	○	25CB	○	25CB	HT	0009
10	A	◙	25D9	◙	25D9	LF	000A
11	B	♂︎	2642	♂︎	2642	VT	000B
12	C	♀︎	2640	♀︎	2640	FF	000C
13	D	♪	266A	♪	266A	CR	000D
14	E	♫	266B	♫	266B	SO	000E
15	F	☼	263C	☼	263C	SI	000F
16	10	►	25BA	►	25BA	DLE	0010
17	11	◄	25C4	◄	25C4	DC1	0011
18	12	↕︎	2195	↕︎	2195	DC2	0012
19	13	‼︎	203C	‼︎	203C	DC3	0013
20	14	¶	00B6	¶	00B6	DC4	0014
21	15	§	00A7	§	00A7	NAK	0015
22	16	▬	25AC	▬	25AC	SYN	0016
23	17	↨	21A8	↨	21A8	ETB	0017
24	18	↑	2191	↑	2191	CAN	0018
25	19	↓	2193	↓	2193	EM	0019
26	1A	→	2192	→	2192	SUB	001A
27	1B	←	2190	←	2190	ESC	001B
28	1C	∟	221F	∟	221F	FS	001C
29	1D	↔︎	2194	↔︎	2194	GS	001D
30	1E	▲	25B2	▲	25B2	RS	001E
31	1F	▼	25BC	▼	25BC	US	001F
32	20	SP	0020	SP	0020	SP	0020
33	21	!	0021	!	0021	!	0021
34	22	"	0022	"	0022	"	0022
35	23	#	0023	#	0023	#	0023
36	24	$	0024	$	0024	$	0024
37	25	%	0025	%	0025	%	0025
38	26	&	0026	&	0026	&	0026
39	27	'	0027	'	0027	'	0027
40	28	(	0028	(	0028	(	0028
41	29	)	0029	)	0029	)	0029
42	2A	*	002A	*	002A	*	002A
43	2B	+	002B	+	002B	+	002B
44	2C	,	002C	,	002C	,	002C
45	2D	-	002D	-	002D	-	002D
46	2E	.	002E	.	002E	.	002E
47	2F	/	002F	/	002F	/	002F
48	30	0	0030	0	0030	0	0030
49	31	1	0031	1	0031	1	0031
50	32	2	0032	2	0032	2	0032
51	33	3	0033	3	0033	3	0033
52	34	4	0034	4	0034	4	0034
53	35	5	0035	5	0035	5	0035
54	36	6	0036	6	0036	6	0036
55	37	7	0037	7	0037	7	0037
56	38	8	0038	8	0038	8	0038
57	39	9	0039	9	0039	9	0039
58	3A	:	003A	:	003A	:	003A
59	3B	;	003B	;	003B	;	003B
60	3C	<	003C	<	003C	<	003C
61	3D	=	003D	=	003D	=	003D
62	3E	>	003E	>	003E	>	003E
63	3F	?	003F	?	003F	?	003F
64	40	@	0040	@	0040	@	0040
65	41	A	0041	A	0041	A	0041
66	42	B	0042	B	0042	B	0042
67	43	C	0043	C	0043	C	0043
68	44	D	0044	D	0044	D	0044
69	45	E	0045	E	0045	E	0045
70	46	F	0046	F	0046	F	0046
71	47	G	0047	G	0047	G	0047
72	48	H	0048	H	0048	H	0048
73	49	I	0049	I	0049	I	0049
74	4A	J	004A	J	004A	J	004A
75	4B	K	004B	K	004B	K	004B
76	4C	L	004C	L	004C	L	004C
77	4D	M	004D	M	004D	M	004D
78	4E	N	004E	N	004E	N	004E
79	4F	O	004F	O	004F	O	004F
80	50	P	0050	P	0050	P	0050
81	51	Q	0051	Q	0051	Q	0051
82	52	R	0052	R	0052	R	0052
83	53	S	0053	S	0053	S	0053
84	54	T	0054	T	0054	T	0054
85	55	U	0055	U	0055	U	0055
86	56	V	0056	V	0056	V	0056
87	57	W	0057	W	0057	W	0057
88	58	X	0058	X	0058	X	0058
89	59	Y	0059	Y	0059	Y	0059
90	5A	Z	005A	Z	005A	Z	005A
91	5B	[	005B	[	005B	[	005B
92	5C	\	005C	\	005C	\	005C
93	5D	]	005D	]	005D	]	005D
94	5E	^	005E	^	005E	^	005E
95	5F	_	005F	_	005F	_	005F
96	60	`	0060	`	0060	`	0060
97	61	a	0061	a	0061	a	0061
98	62	b	0062	b	0062	b	0062
99	63	c	0063	c	0063	c	0063
100	64	d	0064	d	0064	d	0064
101	65	e	0065	e	0065	e	0065
102	66	f	0066	f	0066	f	0066
103	67	g	0067	g	0067	g	0067
104	68	h	0068	h	0068	h	0068
105	69	i	0069	i	0069	i	0069
106	6A	j	006A	j	006A	j	006A
107	6B	k	006B	k	006B	k	006B
108	6C	l	006C	l	006C	l	006C
109	6D	m	006D	m	006D	m	006D
110	6E	n	006E	n	006E	n	006E
111	6F	o	006F	o	006F	o	006F
112	70	p	0070	p	0070	p	0070
113	71	q	0071	q	0071	q	0071
114	72	r	0072	r	0072	r	0072
115	73	s	0073	s	0073	s	0073
116	74	t	0074	t	0074	t	0074
117	75	u	0075	u	0075	u	0075
118	76	v	0076	v	0076	v	0076
119	77	w	0077	w	0077	w	0077
120	78	x	0078	x	0078	x	0078
121	79	y	0079	y	0079	y	0079
122	7A	z	007A	z	007A	z	007A
123	7B	{	007B	{	007B	{	007B
124	7C	\|	007C	\|	007C	\|	007C
125	7D	}	007D	}	007D	}	007D
126	7E	~	007E	~	007E	~	007E
127	7F	⌂	2302	⌂	2302	DEL	007F
128	80	Ç	00C7	Ç	00C7	€	20AC
129	81	ü	00FC	ü	00FC	UNUSED
130	82	é	00E9	é	00E9	‚	201A
131	83	â	00E2	â	00E2	ƒ	0192
132	84	ä	00E4	ä	00E4	„	201E
133	85	à	00E0	à	00E0	…	2026
134	86	å	00E5	å	00E5	†	2020
135	87	ç	00E7	ç	00E7	‡	2021
136	88	ê	00EA	ê	00EA	ˆ	02C6
137	89	ë	00EB	ë	00EB	‰	2030
138	8A	è	00E8	è	00E8	Š	0160
139	8B	ï	00EF	ï	00EF	‹	2039
140	8C	î	00EE	î	00EE	Œ	0152
141	8D	ì	00EC	ì	00EC	UNUSED
142	8E	Ä	00C4	Ä	00C4	Ž	017D
143	8F	Å	00C5	Å	00C5	UNUSED
144	90	É	00C9	É	00C9	UNUSED
145	91	æ	00E6	æ	00E6	‘	2018
146	92	Æ	00C6	Æ	00C6	’	2019
147	93	ô	00F4	ô	00F4	“	201C
148	94	ö	00F6	ö	00F6	”	201D
149	95	ò	00F2	ò	00F2	•	2022
150	96	û	00FB	û	00FB	–	2013
151	97	ù	00F9	ù	00F9	—	2014
152	98	ÿ	00FF	ÿ	00FF	˜	02DC
153	99	Ö	00D6	Ö	00D6	™	2122
154	9A	Ü	00DC	Ü	00DC	š	0161
155	9B	¢	00A2	ø	00F8	›	203A
156	9C	£	00A3	£	00A3	œ	0153
157	9D	¥	00A5	Ø	00D8	UNUSED
158	9E	₧	20A7	×	00D7	ž	017E
159	9F	ƒ	0192	ƒ	0192	Ÿ	0178
160	A0	á	00E1	á	00E1	NBSP	00A0
161	A1	í	00ED	í	00ED	¡	00A1
162	A2	ó	00F3	ó	00F3	¢	00A2
163	A3	ú	00FA	ú	00FA	£	00A3
164	A4	ñ	00F1	ñ	00F1	¤	00A4
165	A5	Ñ	00D1	Ñ	00D1	¥	00A5
166	A6	ª	00AA	ª	00AA	¦	00A6
167	A7	º	00BA	º	00BA	§	00A7
168	A8	¿	00BF	¿	00BF	¨	00A8
169	A9	⌐	2310	®	00AE	©	00A9
170	AA	¬	00AC	¬	00AC	ª	00AA
171	AB	½	00BD	½	00BD	«	00AB
172	AC	¼	00BC	¼	00BC	¬	00AC
173	AD	¡	00A1	¡	00A1	SHY	00AD
174	AE	«	00AB	«	00AB	®	00AE
175	AF	»	00BB	»	00BB	¯	00AF
176	B0	░	2591	░	2591	°	00B0
177	B1	▒	2592	▒	2592	±	00B1
178	B2	▓	2593	▓	2593	²	00B2
179	B3	│	2502	│	2502	³	00B3
180	B4	┤	2524	┤	2524	´	00B4
181	B5	╡	2561	Á	00C1	µ	00B5
182	B6	╢	2562	Â	00C2	¶	00B6
183	B7	╖	2556	À	00C0	·	00B7
184	B8	╕	2555	©	00A9	¸	00B8
185	B9	╣	2563	╣	2563	¹	00B9
186	BA	║	2551	║	2551	º	00BA
187	BB	╗	2557	╗	2557	»	00BB
188	BC	╝	255D	╝	255D	¼	00BC
189	BD	╜	255C	¢	00A2	½	00BD
190	BE	╛	255B	¥	00A5	¾	00BE
191	BF	┐	2510	┐	2510	¿	00BF
192	C0	└	2514	└	2514	À	00C0
193	C1	┴	2534	┴	2534	Á	00C1
194	C2	┬	252C	┬	252C	Â	00C2
195	C3	├	251C	├	251C	Ã	00C3
196	C4	─	2500	─	2500	Ä	00C4
197	C5	┼	253C	┼	253C	Å	00C5
198	C6	╞	255E	ã	00E3	Æ	00C6
199	C7	╟	255F	Ã	00C3	Ç	00C7
200	C8	╚	255A	╚	255A	È	00C8
201	C9	╔	2554	╔	2554	É	00C9
202	CA	╩	2569	╩	2569	Ê	00CA
203	CB	╦	2566	╦	2566	Ë	00CB
204	CC	╠	2560	╠	2560	Ì	00CC
205	CD	═	2550	═	2550	Í	00CD
206	CE	╬	256C	╬	256C	Î	00CE
207	CF	╧	2567	¤	00A4	Ï	00CF
208	D0	╨	2568	ð	00F0	Ð	00D0
209	D1	╤	2564	Ð	00D0	Ñ	00D1
210	D2	╥	2565	Ê	00CA	Ò	00D2
211	D3	╙	2559	Ë	00CB	Ó	00D3
212	D4	╘	2558	È	00C8	Ô	00D4
213	D5	╒	2552	ı	0131	Õ	00D5
214	D6	╓	2553	Í	00CD	Ö	00D6
215	D7	╫	256B	Î	00CE	×	00D7
216	D8	╪	256A	Ï	00CF	Ø	00D8
217	D9	┘	2518	┘	2518	Ù	00D9
218	DA	┌	250C	┌	250C	Ú	00DA
219	DB	█	2588	█	2588	Û	00DB
220	DC	▄	2584	▄	2584	Ü	00DC
221	DD	▌	258C	¦	00A6	Ý	00DD
222	DE	▐	2590	Ì	00CC	Þ	00DE
223	DF	▀	2580	▀	2580	ß	00DF
224	E0	α	03B1	Ó	00D3	à	00E0
225	E1	ß	00DF	ß	00DF	á	00E1
226	E2	Γ	0393	Ô	00D4	â	00E2
227	E3	π	03C0	Ò	00D2	ã	00E3
228	E4	Σ	03A3	õ	00F5	ä	00E4
229	E5	σ	03C3	Õ	00D5	å	00E5
230	E6	µ	00B5	µ	00B5	æ	00E6
231	E7	τ	03C4	þ	00FE	ç	00E7
232	E8	Φ	03A6	Þ	00DE	è	00E8
233	E9	Θ	0398	Ú	00DA	é	00E9
234	EA	Ω	03A9	Û	00DB	ê	00EA
235	EB	δ	03B4	Ù	00D9	ë	00EB
236	EC	∞	221E	ý	00FD	ì	00EC
237	ED	φ	03C6	Ý	00DD	í	00ED
238	EE	ε	03B5	¯	00AF	î	00EE
239	EF	∩	2229	´	00B4	ï	00EF
240	F0	≡	2261	SHY	00AD	ð	00F0
241	F1	±	00B1	±	00B1	ñ	00F1
242	F2	≥	2265	‗	2017	ò	00F2
243	F3	≤	2264	¾	00BE	ó	00F3
244	F4	⌠	2320	¶	00B6	ô	00F4
245	F5	⌡	2321	§	00A7	õ	00F5
246	F6	÷	00F7	÷	00F7	ö	00F6
247	F7	≈	2248	¸	00B8	÷	00F7
248	F8	°	00B0	°	00B0	ø	00F8
249	F9	∙	2219	¨	00A8	ù	00F9
250	FA	·	00B7	·	00B7	ú	00FA
251	FB	√	221A	¹	00B9	û	00FB
252	FC	ⁿ	207F	³	00B3	ü	00FC
253	FD	²	00B2	²	00B2	ý	00FD
254	FE	■	25A0	■	25A0	þ	00FE
255	FF	NBSP	00A0	NBSP	00A0	ÿ	00FF

SP is the space character.
NBSP is the "no break space" character.
SHY is the "soft hyphen" character.
"₧" is one character, signifying the "pesetas" currency.