+ 2

Utf

What exactly does UTF mean in python and what is its utility?

9th Feb 2018, 4:09 PM

Owolawi Kehinde

2 Answers

+ 8

https://www.fileformat.info/info/unicode/utf8.htm UTF-8 is a compromise character encoding that can be as compact as ASCII (if the file is just plain English text) but can also contain any unicode characters (with some increase in file size). UTF stands for Unicode Transformation Format. The '8' means it uses 8-bit blocks to represent a character. The number of blocks needed to represent a character varies from 1 to 4. One of the really nice features of UTF-8 is that it is compatible with nul-terminated strings. No character will have a nul (0) byte when encoded. This means that C code that deals with char[] will "just work".

9th Feb 2018, 4:11 PM

Fata1 Err0r

+ 3

UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules: -> If the code point is <128, it’s represented by the corresponding byte value. -> If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255. -> Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255. UTF-8 has several convenient properties: -> It can handle any Unicode code point. -> A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes. -> A string of ASCII text is also valid UTF-8 text. -> UTF-8-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte. -> If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8. Usually this is implemented by converting the Unicode string into some encoding that varies depending on the system. For example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on Windows, Python uses the name “mbcs” to refer to whatever the currently configured encoding is.

9th Feb 2018, 4:44 PM

Diwakar