Understanding Character Encoding: ASCII, Unicode, and UTF-8

If you have ever seen text display as ??? or â€™ instead of an apostrophe, you have encountered a character encoding mismatch. Character encoding is the system that maps characters — letters, digits, symbols, emoji — to binary values that computers can store and transmit. When the encoder and decoder disagree on which system to use, the result is garbled text. Understanding the history and mechanics of encoding fixes this class of bugs permanently.

ASCII: The Foundation

ASCII (American Standard Code for Information Interchange), defined in 1963, maps 128 characters — uppercase and lowercase English letters, digits, punctuation, and control characters — to 7-bit values (0–127). ASCII is the foundation of nearly every encoding system that came after it. The letter A is 65, a is 97, the digit 0 is 48. ASCII is universal: every modern encoding system agrees on these 128 assignments.

The Problem With Extended ASCII

The 8th bit in a byte gave room for 128 more characters (128–255), and many competing standards emerged to use that space: ISO-8859-1 (Latin-1) for Western European languages, Windows-1252 for Windows, KOI8-R for Russian. These 'code pages' were incompatible — byte 0x91 meant a curly quote in Windows-1252 and a different character in ISO-8859-1. Documents exchanged between systems displayed wrong characters unless both systems agreed on which code page to use. Asian languages needed even more characters than a single byte could hold and required multi-byte encodings like Shift-JIS and GB2312, which had their own incompatibilities.

Unicode: One Universal Standard

Unicode assigns a unique number (code point) to every character in every writing system on earth, plus emoji, mathematical symbols, and more — over 140,000 characters in the current standard. Code points are written as U+ followed by a hexadecimal number: A is U+0041, the Euro sign is U+20AC, and the snowman emoji is U+2603. Unicode does not specify how to store these numbers as bytes — that is the role of encoding forms.

UTF-8: The Dominant Encoding

UTF-8 is a variable-length encoding of Unicode. Characters in the ASCII range (U+0000 to U+007F) are stored as a single byte, identical to ASCII. Characters from U+0080 to U+07FF use two bytes. Characters from U+0800 to U+FFFF use three bytes. Characters above U+FFFF (including most emoji) use four bytes. This design means ASCII text is valid UTF-8, which allowed UTF-8 to be adopted incrementally alongside legacy ASCII systems.

UTF-8 is now the dominant encoding on the web — over 98% of websites use it. Always declare your encoding explicitly: in HTML with <meta charset='UTF-8'>, in HTTP responses with Content-Type: text/html; charset=UTF-8, and in database connections with the appropriate character set setting. When the encoding is not declared, parsers guess, and they sometimes guess wrong.

Diagnosing Encoding Problems

The pattern â€™ appearing instead of ' is a classic encoding mismatch: the apostrophe was stored as UTF-8 (three bytes: 0xE2 0x80 0x99) and then decoded as Windows-1252, which maps those three bytes to â, €, and ™. If you see this pattern, find where the text crosses a system boundary (file read, database query, HTTP response, email) without an explicit encoding declaration, and add one. The fix is always to declare UTF-8 consistently at every layer: source file, database, HTTP header, and HTML meta tag.