Python: UTF-8 Encoding \x99\x99 and \u9999...?
What are differences between UTF-8 encoding \x99\x99 and \u9999? And how do we encode UTF-8 strings to the latter format, e.g. \u0420\u043e\u0441\u0441\u0438\u044f?
![]() |
---|
Python: UTF-8 Encoding \x99\x99\ and \u9999…? |
Python fully supports Unicode strings – see https://docs.python.org/3/howto/unicode.html. We can see this by printing out Australia in different languages:
# Chinese ( Simplified ).
print( '澳大利亚' )
# Chinese ( Traditional ).
print( '澳大利亞' )
# Japanese.
print( 'オーストラリア' )
# Khmer.
print( 'អូស្ត្រាលី' )
# Korean: -- I think this is Australia continent.
print( '호주' )
# Russian.
print( 'Австралия' )
# Vietnamese -- Australia continent
print( 'Úc Châu' )
# Vietnamese -- Long form of Australia country.
print( 'Úc Đại Lợi' )
Method string.encode( ‘utf-8’ ) ( https://docs.python.org/3/library/stdtypes.html#str.encode ) can be seen frenquently in forums and posts. Let’s try it:
australia_in_russian = 'Австралия'
encoded_bytes = australia_in_russian.encode( 'utf-8' )
print( '1. ', encoded_bytes )
print( '2. ', b'\xd0\x90\xd0\xb2\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd0\xbb\xd0\xb8\xd1\x8f' )
print( '3. ', encoded_bytes.decode('utf-8') )
Please note that the byte sequence in:
print( '2. ', b'\xd0\x90\xd0\xb2\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd0\xbb\xd0\xb8\xd1\x8f' )
is copied from the output of:
print( '1. ', encoded_bytes )
If we search Google for utf-8 online encoder, we will find several sites. These two ( 2 ) sites https://mothereff.in/utf-8 and https://www.browserling.com/tools/utf8-encode produce the same output as string.encode( ‘utf-8’ ) ( https://docs.python.org/3/library/stdtypes.html#str.encode ).
Relating to Python, apart from \xd0\x90, we also see references to \u9999, e.g. \u0410. Take an example from this post https://stackoverflow.com/questions/10569438/how-to-print-unicode-character-in-python:
print('\u0420\u043e\u0441\u0441\u0438\u044f')
which will produce:
Россия
This post https://stackoverflow.com/questions/55737130/relationship-between-x-and-unicode-codepoints which refers to https://www.fileformat.info/info/unicode/utf8.htm which explains the differences between the two.
Still, resulting from utf-8 online encoder search. This site https://checkserp.com/encode/utf8/ produces Unicode code point string. I.e., if we use it to “UTF-8 Encode” Россия, ( the above example ), we will get back \u0420\u043e\u0441\u0441\u0438\u044f – which matches what we have above.
Naturally, we would want to encode UTF-8 to Unicode code point string for our own understanding; i.e. encode Россия to \u0420\u043e\u0441\u0441\u0438\u044f.
I could not find any example on how to do that. In this post, https://stackoverflow.com/questions/2269827/how-to-convert-an-int-to-a-hex-string, user Chengcheng Zhang discusses to how get characters’ Hex codes equivalent from the characters’ integer Unicode codes:
(434).to_bytes(4, byteorder='big').hex()
See also:
- https://docs.python.org/3/library/functions.html#ord -- ord( c )
- https://docs.python.org/3/library/stdtypes.html -- int.to_bytes( ... )
- https://docs.python.org/3/library/functions.html#hex -- hex( x )
The following example is my own attempt at going about this:
australia_in_russian = 'Австралия'
unicode_point_hex_str = ''
for c in australia_in_russian:
unicode_point_hex_char = '\\u' + (ord(c)).to_bytes(2, byteorder='big').hex()
unicode_point_hex_str += unicode_point_hex_char \
if len(unicode_point_hex_str) > 0 else unicode_point_hex_char
print( '''c: {}, ord(c): {}, encode: {}, code point: {}'''. \
format( c, ord(c), c.encode('utf-8'), unicode_point_hex_char ) )
print( '1. ', unicode_point_hex_str )
"""
unicode_point_hex_str should be a string.
"""
print( '2. ', type(unicode_point_hex_str) )
"""
The literal string in 3. and 4. is the print out from: print( '1. ', unicode_point_hex_str )
"""
print( '3. ', type('\u0410\u0432\u0441\u0442\u0440\u0430\u043b\u0438\u044f') )
print( '4. ', '\u0410\u0432\u0441\u0442\u0440\u0430\u043b\u0438\u044f' )
"""
See https://stackoverflow.com/questions/24242433/how-to-convert-a-raw-string-into-a-normal-string
Also "codecs — Codec registry and base classes"
https://docs.python.org/3/library/codecs.html#text-encodings
unicode_escape is explained under "Text Encodings".
"""
import codecs
str = codecs.decode( unicode_point_hex_str, 'unicode_escape' )
print( '5. ', str )
For:
print( '1. ', unicode_point_hex_str )
I was expecting to see the Russian text of Australia; simply, as we’ve seen above, when printing hard-coded Unicode code point string, the natural text equivalent gets printed; but it prints out the Unicode code point string. The last four ( 4 ) print statements are debugging statements to understand why I did not get the natural text – it has something to do with raw string, the hard-coded ones are not “raw”. The codes should be self-documenting.
There is still a lot to this subject. I was just seeking to answer my own, one question. I hope you find this post useful and thanking you for visiting.