Python: UTF-8 Encoding \x99\x99 and \u9999...?
What are differences between UTF-8 encoding \x99\x99 and \u9999? And how do we encode UTF-8 strings to the latter format, e.g. \u0420\u043e\u0441\u0441\u0438\u044f?
Python: UTF-8 Encoding \x99\x99\ and \u9999…? |
Python fully supports Unicode strings – see https://docs.python.org/3/howto/unicode.html. We can see this by printing out Australia in different languages:
Method string.encode( ‘utf-8’ ) ( https://docs.python.org/3/library/stdtypes.html#str.encode ) can be seen frenquently in forums and posts. Let’s try it:
Please note that the byte sequence in:
is copied from the output of:
If we search Google for utf-8 online encoder, we will find several sites. These two ( 2 ) sites https://mothereff.in/utf-8 and https://www.browserling.com/tools/utf8-encode produce the same output as string.encode( ‘utf-8’ ) ( https://docs.python.org/3/library/stdtypes.html#str.encode ).
Relating to Python, apart from \xd0\x90, we also see references to \u9999, e.g. \u0410. Take an example from this post https://stackoverflow.com/questions/10569438/how-to-print-unicode-character-in-python:
which will produce:
Россия
This post https://stackoverflow.com/questions/55737130/relationship-between-x-and-unicode-codepoints which refers to https://www.fileformat.info/info/unicode/utf8.htm which explains the differences between the two.
Still, resulting from utf-8 online encoder search. This site https://checkserp.com/encode/utf8/ produces Unicode code point string. I.e., if we use it to “UTF-8 Encode” Россия, ( the above example ), we will get back \u0420\u043e\u0441\u0441\u0438\u044f – which matches what we have above.
Naturally, we would want to encode UTF-8 to Unicode code point string for our own understanding; i.e. encode Россия to \u0420\u043e\u0441\u0441\u0438\u044f.
I could not find any example on how to do that. In this post, https://stackoverflow.com/questions/2269827/how-to-convert-an-int-to-a-hex-string, user Chengcheng Zhang discusses to how get characters’ Hex codes equivalent from the characters’ integer Unicode codes:
See also:
- https://docs.python.org/3/library/functions.html#ord -- ord( c )
- https://docs.python.org/3/library/stdtypes.html -- int.to_bytes( ... )
- https://docs.python.org/3/library/functions.html#hex -- hex( x )
The following example is my own attempt at going about this:
For:
I was expecting to see the Russian text of Australia; simply, as we’ve seen above, when printing hard-coded Unicode code point string, the natural text equivalent gets printed; but it prints out the Unicode code point string. The last four ( 4 ) print statements are debugging statements to understand why I did not get the natural text – it has something to do with raw string, the hard-coded ones are not “raw”. The codes should be self-documenting.
There is still a lot to this subject. I was just seeking to answer my own, one question. I hope you find this post useful and thanking you for visiting.