A new question popped into my head recently, triggered by working with ReportLab. When we declare an HTML page as UTF-8, we can display all kinds of human languages using the “font-family that we specify in the CSS” ( my erroneous assumption! ). Whereas with PDF tools, we need to select appropriate fonts for target human languages that we want to display. The question, therefore, is: how do browsers manage that?

022-feature-image.png
How UTF-8 Gets Displayed by Browsers and PDF creation tools?

Environments

  1. Windows 10 Pro, version 21H2, OS build 19044.1706.
  2. FireFox 100.0.2 (64-bit).
  3. Python 3.10.1.
  4. ReportLab 3.6.9.
  5. ReportLab User Guide version 3.5.56, “Document generated on 2020/12/02 11:31:59”; henceforth “User Guide”, downloadable from https://www.reportlab.com/docs/reportlab-userguide.pdf

The following HTML page illustrates UTF-8 declaration mentioned in the introduction. I purposely specify only Arial font for the page:

<!doctype html>
<html lang="en">
<head>
    <title>Test UTF-8</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

	<style>
        body { font-family: Arial; }
        div { margin: 50px 0 0 50px; }
	</style>
</head>

<body>
    <div>
        <p>“Australia” in some other languages:</p>
        <p>Chinese ( Simplified ): 澳大利亚</p>
        <p>Chinese ( Traditional ): 澳大利亞</p>
        <p>Japanese: オーストラリア</p>
        <p>Khmer: អូស្ត្រាលី</p>
        <p>Korean: 호주 -- I think this is Australia continent.</p>
        <p>Russian: Австралия</p>
        <p>Vietnamese: Úc Châu -- Australia continent.</p>
        <p>Vietnamese: Úc Đại Lợi -- Long form of Australia country.</p>
    </div>
</body>

</html>

We understand that, within Windows, available fonts are in the “Fonts” folder ( directory ), directly under the Windows installation directory. In my case, it is:

C:\Windows\Fonts

-- I had always assumed that, the Arial fonts shipped with Windows are capable of displaying all human languages available under the UTF-8 character encoding as defined by the Unicode Standard!

ReportLab User Guide section 3.6 Asian Font Support, page 53, spells out clearly that we need to load appropriate fonts for languages that we want to work with. Under the above aforementioned assumption, I thought loading Windows Arial font would give me a PDF text similar to the HTML above:

import os

import reportlab.rl_config
reportlab.rl_config.warnOnMissingFontGlyphs = 0
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
from reportlab.lib.colors import tan, green, yellow, red

def get_font_height( font_name, font_size ):    
    ascent, descent = pdfmetrics.getAscentDescent( font_name, font_size )
    return ( ascent - descent ) + ( ascent / 2 )

font_name = 'ArialMT'
font_file = 'arial.ttf'
font_size = 12

font_path = 'C:\\Windows\\Fonts\\' # os.path.join( os.path.dirname(__file__), 'fonts\\' )
pdfmetrics.registerFont( TTFont(font_name, font_path + font_file) )

font_height = get_font_height( font_name, font_size )

canvas = canvas.Canvas( '022-reportlab-utf8-a.pdf' )

canvas.setFont( font_name, font_size )

y = 800
canvas.drawString( 100, y, '“Australia” in some other languages:' )

y -= ( font_height * 2 )
canvas.drawString( 100, y, 'Chinese ( Simplified ): 澳大利亚' )

y -= font_height
canvas.drawString( 100, y, 'Chinese ( Traditional ): 澳大利亞' )

y -= font_height
canvas.drawString( 100, y, 'Japanese: オーストラリア' )

y -= font_height
canvas.drawString( 100, y, 'Khmer: អូស្ត្រាលី' )

y -= font_height
canvas.drawString( 100, y, 'Korean: 호주 -- I think this is Australia continent.' )

y -= font_height
canvas.drawString( 100, y, 'Russian: Австралия' )

# Vietnamese -- Australia continent
y -= font_height
canvas.drawString( 100, y, 'Vietnamese: Úc Châu -- Australia continent.' )

y -= font_height
canvas.drawString( 100, y, 'Vietnamese: Úc Đại Lợi -- Long form of Australia country.' )

canvas.save()

-- Note: to work out the file name: arial.ttf -- I just copy the “Arial icon” in my C:\Windows\Fonts folder to the Python script's “fonts” sub-directory. It will then list several tff files. Double click on any one of them, Windows will bring up the font sample dialog, the name of the font is listed in this dialog. I repeat this process for other fonts.

The result was not what I assumed:

022-01.png

In Acrobat-Reader, under File | Properties… | Font tab shows all embedded fonts in the document: ArialMT is used in the document. ArialMT is loaded and used.

Why FireFox is able to display all these languages correctly? After some searching, I came across this post https://stackoverflow.com/questions/884177/how-can-i-determine-what-font-a-browser-is-actually-using-to-render-some-text, the answer provided by user Arjan led me to inspect my page:

022-02.png

– FireFox loads other fonts on its own accord as necessary to display the content correctly. I do the same:

import os

import reportlab.rl_config
reportlab.rl_config.warnOnMissingFontGlyphs = 0
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas
from reportlab.lib.colors import tan, green, yellow, red

def get_font_height( font_name, font_size ):    
    ascent, descent = pdfmetrics.getAscentDescent( font_name, font_size )
    return ( ascent - descent ) + ( ascent / 2 )

arial_font_name = 'ArialMT'
arial_font_file = 'arial.ttf'
leelawadee_ui_font_name = 'Leelawadee UI'
leelawadee_ui_font_file = 'LeelawUI.ttf'
malgun_gothic_font_name = 'Malgun Gothic'
malgun_gothic_font_file = 'malgun.ttf'
microsoft_yahei_font_name = 'Microsoft YaHei'
microsoft_yahei_font_file = 'MicrosoftYaHei-01.ttf'

font_size = 12

font_path = 'C:\\Windows\\Fonts\\' # os.path.join( os.path.dirname(__file__), 'fonts\\' )
pdfmetrics.registerFont( TTFont(arial_font_name, font_path + arial_font_file) )
pdfmetrics.registerFont( TTFont(leelawadee_ui_font_name, font_path + leelawadee_ui_font_file) )
pdfmetrics.registerFont( TTFont(malgun_gothic_font_name, font_path + malgun_gothic_font_file) )

font_path = os.path.join( os.path.dirname(__file__), 'fonts\\' )
pdfmetrics.registerFont( TTFont(microsoft_yahei_font_name, font_path + microsoft_yahei_font_file) )

font_height = get_font_height( arial_font_name, font_size )

canvas = canvas.Canvas( '022-reportlab-utf8-b.pdf' )

canvas.setFont( arial_font_name, font_size )
canvas.setFont( leelawadee_ui_font_name, font_size )
canvas.setFont( malgun_gothic_font_name, font_size )
canvas.setFont( microsoft_yahei_font_name, font_size )

y = 800
canvas.drawString( 100, y, '“Australia” in some other languages:' )

"""
Chinese Simplified and Traditional font.
"""
canvas.setFont( microsoft_yahei_font_name, font_size )

y -= ( font_height * 2 )
canvas.drawString( 100, y, 'Chinese ( Simplified ): 澳大利亚' )

y -= font_height
canvas.drawString( 100, y, 'Chinese ( Traditional ): 澳大利亞' )

"""
Japanese font.
"""
canvas.setFont( malgun_gothic_font_name, font_size )
y -= font_height
canvas.drawString( 100, y, 'Japanese: オーストラリア' )

"""
Khmer font.
"""
canvas.setFont( leelawadee_ui_font_name, font_size )
y -= font_height
canvas.drawString( 100, y, 'Khmer: អូស្ត្រាលី' )

"""
Korean font.
"""
canvas.setFont( malgun_gothic_font_name, font_size )
y -= font_height
canvas.drawString( 100, y, 'Korean: 호주 -- I think this is Australia continent.' )

"""
Russian and Vietnamese font.
"""
canvas.setFont( arial_font_name, font_size )

y -= font_height
canvas.drawString( 100, y, 'Russian: Австралия' )

y -= font_height
canvas.drawString( 100, y, 'Vietnamese: Úc Châu -- Australia continent.' )

y -= font_height
canvas.drawString( 100, y, 'Vietnamese: Úc Đại Lợi -- Long form of Australia country.' )

canvas.save()

Microsoft YaHei font files have ttc extension. I use https://transfonter.org/ to convert msyhl.ttc to ttf, and store the result files in the Python script’s “fonts” sub-directory. This time, the result is what I have anticipated:

022-03.png

On font, I found this article https://css-tricks.com/understanding-web-fonts-getting/ very informative.

Font is a very large subject. I was just trying to answer my own question. I am happy with what I have found. And I hope you find this post useful, and thank you for visiting.