What are differences between UTF-8 encoding \x99\x99 and \u9999? And how do we encode UTF-8 strings to the latter format, e.g. \u0420\u043e\u0441\u0441\u0438\u044f?

021-feature-image.png
Python: UTF-8 Encoding \x99\x99\ and \u9999…?

Python fully supports Unicode strings – see https://docs.python.org/3/howto/unicode.html. We can see this by printing out Australia in different languages:

# Chinese ( Simplified ).
print( '澳大利亚' )
# Chinese ( Traditional ).
print( '澳大利亞' )
# Japanese.
print( 'オーストラリア' )
# Khmer.
print( 'អូស្ត្រាលី' )
# Korean: -- I think this is Australia continent.
print( '호주' ) 
# Russian.
print( 'Австралия' )
# Vietnamese -- Australia continent
print( 'Úc Châu' )
# Vietnamese -- Long form of Australia country.
print( 'Úc Đại Lợi' )

Method string.encode( ‘utf-8’ ) ( https://docs.python.org/3/library/stdtypes.html#str.encode ) can be seen frenquently in forums and posts. Let’s try it:

australia_in_russian = 'Австралия'

encoded_bytes = australia_in_russian.encode( 'utf-8' )

print( '1. ', encoded_bytes )

print( '2. ', b'\xd0\x90\xd0\xb2\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd0\xbb\xd0\xb8\xd1\x8f' )

print( '3. ', encoded_bytes.decode('utf-8') )

Please note that the byte sequence in:

print( '2. ', b'\xd0\x90\xd0\xb2\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd0\xbb\xd0\xb8\xd1\x8f' )

is copied from the output of:

print( '1. ', encoded_bytes )

If we search Google for utf-8 online encoder, we will find several sites. These two ( 2 ) sites https://mothereff.in/utf-8 and https://www.browserling.com/tools/utf8-encode produce the same output as string.encode( ‘utf-8’ ) ( https://docs.python.org/3/library/stdtypes.html#str.encode ).

Relating to Python, apart from \xd0\x90, we also see references to \u9999, e.g. \u0410. Take an example from this post https://stackoverflow.com/questions/10569438/how-to-print-unicode-character-in-python:

print('\u0420\u043e\u0441\u0441\u0438\u044f')

which will produce:

Россия

This post https://stackoverflow.com/questions/55737130/relationship-between-x-and-unicode-codepoints which refers to https://www.fileformat.info/info/unicode/utf8.htm which explains the differences between the two.

Still, resulting from utf-8 online encoder search. This site https://checkserp.com/encode/utf8/ produces Unicode code point string. I.e., if we use it to “UTF-8 Encode” Россия, ( the above example ), we will get back \u0420\u043e\u0441\u0441\u0438\u044f – which matches what we have above.

Naturally, we would want to encode UTF-8 to Unicode code point string for our own understanding; i.e. encode Россия to \u0420\u043e\u0441\u0441\u0438\u044f.

I could not find any example on how to do that. In this post, https://stackoverflow.com/questions/2269827/how-to-convert-an-int-to-a-hex-string, user Chengcheng Zhang discusses to how get characters’ Hex codes equivalent from the characters’ integer Unicode codes:

(434).to_bytes(4, byteorder='big').hex()

See also:

The following example is my own attempt at going about this:

australia_in_russian = 'Австралия'

unicode_point_hex_str = ''
for c in australia_in_russian:

    unicode_point_hex_char = '\\u' + (ord(c)).to_bytes(2, byteorder='big').hex()

    unicode_point_hex_str += unicode_point_hex_char \
        if len(unicode_point_hex_str) > 0 else unicode_point_hex_char

    print( '''c: {}, ord(c): {}, encode: {}, code point: {}'''. \
	    format( c, ord(c), c.encode('utf-8'), unicode_point_hex_char ) )

print( '1. ', unicode_point_hex_str )

"""
unicode_point_hex_str should be a string.
"""
print( '2. ', type(unicode_point_hex_str) )

"""
The literal string in 3. and 4. is the print out from: print( '1. ', unicode_point_hex_str )
"""
print( '3. ', type('\u0410\u0432\u0441\u0442\u0440\u0430\u043b\u0438\u044f') )
print( '4. ', '\u0410\u0432\u0441\u0442\u0440\u0430\u043b\u0438\u044f' )

"""
See https://stackoverflow.com/questions/24242433/how-to-convert-a-raw-string-into-a-normal-string

Also "codecs — Codec registry and base classes"
    https://docs.python.org/3/library/codecs.html#text-encodings

    unicode_escape is explained under "Text Encodings".
"""
import codecs
str = codecs.decode( unicode_point_hex_str, 'unicode_escape' )
print( '5. ', str )

For:

print( '1. ', unicode_point_hex_str )

I was expecting to see the Russian text of Australia; simply, as we’ve seen above, when printing hard-coded Unicode code point string, the natural text equivalent gets printed; but it prints out the Unicode code point string. The last four ( 4 ) print statements are debugging statements to understand why I did not get the natural text – it has something to do with raw string, the hard-coded ones are not “raw”. The codes should be self-documenting.

There is still a lot to this subject. I was just seeking to answer my own, one question. I hope you find this post useful and thanking you for visiting.