+ 2
Utf
What exactly does UTF mean in python and what is its utility?
2 Answers
+ 8
https://www.fileformat.info/info/unicode/utf8.htm
UTF-8 is a compromise character encoding that can be as compact as ASCII (if the file is just plain English text) but can also contain any unicode characters (with some increase in file size).
UTF stands for Unicode Transformation Format. The '8' means it uses 8-bit blocks to represent a character. The number of blocks needed to represent a character varies from 1 to 4.
One of the really nice features of UTF-8 is that it is compatible with nul-terminated strings. No character will have a nul (0) byte when encoded. This means that C code that deals with char[] will "just work".
+ 3
UTF-8 is one of the most commonly used encodings. UTF stands for âUnicode Transformation Formatâ, and the â8â means that 8-bit numbers are used in the encoding. (Thereâs also a UTF-16 encoding, but itâs less frequently used than UTF-8.) UTF-8 uses the following rules:
-> If the code point is <128, itâs represented by the corresponding byte value.
-> If the code point is between 128 and 0x7ff, itâs turned into two byte values between 128 and 255.
-> Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.
UTF-8 has several convenient properties:
-> It can handle any Unicode code point.
-> A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that canât handle zero bytes.
-> A string of ASCII text is also valid UTF-8 text.
-> UTF-8-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
-> If bytes are corrupted or lost, itâs possible to determine the start of the next UTF-8-encoded code point and resynchronize. Itâs also unlikely that random 8-bit data will look like valid UTF-8.
Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. For example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on Windows, Python uses the name âmbcsâ to refer to whatever the currently configured encoding is.