857898929aa2fc4bd75a6fae0bb1c4d6.ppt
- Количество слайдов: 16
ENCODING AND DECODING Experiencing one (or more) bytes out of your A’s
Overview • It’s not your father’s character set – 8 bit characters – ASCII – The rest of the world wakes up to computers • Unicode – Character codes – Different flavors • Encoding and Decoding classes • Example
The Good Old Days • • Focus on unaccented, English letters Every letter, number, capital, etc Represented by codes 0 -127 Space, 32; “A”, 65; “a”, 97 Used 7 bits, one bit free on most computers Wordstar and the 8 th bit Below 32 – control bits 7, beep; 12, formfeed
8 th bit, values 128 -255 • Everybody had their own ideas • OEM Character sets • IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc. ) • Outside U. S. different languages – Code 130
8 th bit, values 128 -255 • Everybody had their own ideas • OEM Character sets • IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc. ) • Outside U. S. different languages – Code 130
8 th bit, values 128 -255 • Everybody had their own ideas • OEM Character sets • IBM-PC -> graphics (horizontal bars, vertical bars, bars with dangles, etc. ) • Outside U. S. different languages – Code 130 é in US, Gimel ג character in Israel – Difficult to exchange documents • Code pages – regional definition of bit values 128 -255 – Israel: Code page 862 – Greek: Code page 737 – ISO/ANSI code pages • Asia – Alphabets had thousands of characters – No way to store in one byte (8 bits)
Unicode • Not a 16 -bit code • A new way of thinking about characters • Old way: – Character “A” maps to memory or disk bits – A-> 0100 0001 • Unicode way: – – Each letter in every alphabet maps to a “code point” Abstract concept “A” is Platonic “form” – just floats out there A -> U+0639 code point
Unicode • Hello -> U+0048 U+0065 U+006 C U+006 F • Storing in 2 bytes each: – 0048 0065 006 C 006 F (big endian) – Or 4800 6500 6 C 00 6 F 00 (little endian) • Need to have a Byte Order Mark (BOM) at beginning of stream • UTF 8 coding system – – Stores Unicode points (magic numbers) as 8 bit bytes Values 0 -127 go into byte 1 Values 128+ go into bytes 2, 3, etc. For characters up to 127, UTF 8 looks just like ASCII
UNICODE Encodings • UTF-8 • UTF-16 – characters stored in 2 byte, 16 -bit (halfword) sequences – also called UTF-2 • UTF-32 – characters stored in 4 byte, 32 bit sequences • UTF-7 – forces a zero in high order bit - firewalls • Ascii Encoding – everything above 7 bits is dropped
Definitions • . NET uses UTF-16 encoding internally to store text • Encoding: – transfers a set of Unicode characters into a sequence of bytes – Send a string to a file or a network stream • Decoding: – transfers a sequence of bytes into a set of Unicode characters – Read a string from a file or a network stream • Stream. Reader, Stream. Writer default to UTF-8
Encoding/Decoding Classes • UTF 32 Encoding class – Convert characters to and from UTF-32 encoding • Unicode. Encoding class – Convert characters to and from UTF-16 encoding • UTF 8 Encoding class to convert to and from UTF-8 encoding – 1, 2, 3, or 4 bytes per char • ASCIIEncoding class to convert to and from ASCII Encoding – drops all values > 127 • System. Text. Encoding supports a wide range of ANSI/ISO encodings
Convert a string into a stream of encoded bytes 1. Get an encoding object Encoding e = Encoding. Get. Encoding(“Korean”); 2. use the encoding object’s Get. Bytes() method to convert a string into its byte representation byte[ ] encoded; encoded = e. Get. Bytes(“I’m gonna be Korean!”); Demo: D: _Framework 2. 0 Training Kits70 -536Chapter 03Encoding. Demo
Write a file in encoded form File. Stream fs = new File. Stream("text. txt", File. Mode. Open. Or. Create); . . . Stream. Writer t = new Stream. Writer (fs, Encoding. UTF 8); t. Write("This is in UTF 8"); Read an encoded file File. Stream fs = new File. Stream("text. txt", File. Mode. Open); . . . Stream. Reader t = new Stream. Reader(fs, Encoding. UTF 8); String s = t. Read. Line();
Summary • ASCII is one of oldest encoding standards. • UNICODE provides multilingual support • System. Text. Encoding has static methods for encoding and decoding text. • Use an overloaded Stream constructor that accepts an encoding object when writing a file. • Not necessary to specify Encoding object when reading, will default.
References • www. unicode. org • Unicode and. Net – what does. NET Provide? http: //www. developerfusion. co. uk/show/4710/3/ • Hello Unicode, Goodbye ASCII http: //www. nicecleanexample. com/View. Article. a spx? TID=unicode_encoding • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) http: //www. joelonsoftware. com/articles/Unicode. html
857898929aa2fc4bd75a6fae0bb1c4d6.ppt