Loosely speaking, font subsetting involves extracting only the characters we need from a font program, such as a TrueType .ttf file. The Arial Unicode MS font program is around 20MB. If we need only a few Vietnamese characters, we can extract and use those, resulting in a font subset of just a few kilobytes.

This article focuses on font subsetting on Windows and Ubuntu as a standalone process. We begin by installing a few standalone font tools on Windows, then explore the font subsetting workflow using the HarfBuzz library.

🦀 Index of the Complete Series.

152-feature-image.png
Rust FFI Font Subsetting Using the HarfBuzz Text Shaping Engine

🚀 The code for this post is in the following GitHub repository: harfbuzz_font_subset.

This article is a continuation of the first article. In this article, we also present a brand new standalone program, whose boilerplate code is also based on the program that accompanied the previously mentioned article.

Install the BirdFont GUI Application

You can download the installer from BirdFont. It is free software, and donations to the developers are welcome.

Originally, ChatGPT recommended FontForge, which I did install and use for a bit. However, it’s quite old, and the UI feels a bit awkward.

Install the Python fontTools Package

This is its repository: https://github.com/fonttools/fonttools. We can use it to convert a font program (such as a TrueType font file) into XML to inspect internal information that would otherwise be hidden.

💡 If you’re unfamiliar with Python Virtual Environments or how to set one up, please see this post: Python: Virtual Environment virtualenv for multiple Python versions, which I wrote a fair while back.

I already have a generic Python development area under F:\pydev\. After changing to the F:\pydev\ directory, activate the virtual environment venv, then install the fonttools package using the following commands:

F:\pydev>venv\Scripts\activate
(venv) F:\pydev>pip install fonttools[ufo,lxml,woff,unicode]

Once completed, we can verify the installation with the following command:

(venv) F:\pydev>ttx --version

My installation reports 4.60.1.

🐧 On Ubuntu, the installation we performed in the first article also installed this tool: ttx --version reports 4.46.0.

🪟 On Windows, I can run it as follows:

(venv) F:\pydev>ttx path\to\font\program\font_file.ttf

🐧 On Ubuntu, the CLI syntax is the same.

The resulting path\to\font\program\font_file.ttx is an XML file that lists the internal structure of the .ttf font program.

Optionally Install HarfBuzz CLI Tools

Recall that the build we performed in the first article also produced some CLI tools for Windows and Ubuntu.

If desired, we can download a prebuilt version of these CLI tools from https://sourceforge.net/projects/harfbuzz.mirror/. I downloaded harfbuzz-win64-12.1.0.zip and extracted its contents to C:\PF\harfbuzz-win64\. I believe this build is provided by the author of HarfBuzz.

Font Programs Used In This Article

🪟 On Windows, we use the standard Arial Unicode MS font, located at C:/Windows/Fonts/arialuni.ttf. 🐧 On Ubuntu, I downloaded Noto_Sans_SC, Noto_Sans_TC,Noto_Serif_TC.zip from Google Fonts, and extracted the contents to /home/behai/Noto_Sans_TC. The font program we are using is NotoSansTC-Regular.ttf.

To keep things simple, we will refer to these fonts using absolute paths.

The hb-shape and hb-subset CLIs

💡 Please note: based on the build and installation process discussed in the first article, 🪟 on Windows, before running these CLIs we need to set the library paths once per terminal session:

set PATH=C:\PF\harfbuzz\build\src\;%PATH%
set PATH=C:\PF\vcpkg\installed\x64-windows\bin\;%PATH%

A glyph ID is an unsigned integer used by a font program to represent a specific visual shape of a character—for example, . The same character can have different glyph IDs in different font programs.

The text we are using for font subsetting throughout this article is “Kỷ độ Long Tuyền đới nguyệt ma. 幾度龍泉戴月磨。” — a single verse in Vietnamese and Chinese from a poem by General Đặng Dung of the Later Trần Dynasty, shortly before the general took his own life in 1413, rather than be executed by his Ming-Chinese captors. This verse means “Countless times I have sharpened my battle sword under the moonlight.”

The hb-shape CLI

The hb-shape CLI is a shaping diagnostics tool. Among its many options, we can use it to obtain the glyph IDs for a given text and font program.

🪟 – In its simplest form:

C:\PF\harfbuzz\build\util\hb-shape.exe ^
C:\Windows\Fonts\arialuni.ttf ^
"Kỷ độ Long Tuyền đới nguyệt ma. 幾度龍泉戴月磨。"

The output looks like:

[gid46=0+1366|gid2985=1+1024|gid3=2+569|gid211=3+1139^
|gid2955=4+1139|gid3=5+569|gid47=6+1139|gid82=7+1139|^
gid81=8+1139|gid74=9+1139|gid3=10+569|gid55=11+1251|^
gid88=12+1139|gid92=13+1024|gid2931=14+1139|gid81=^
15+1139|gid3=16+569|gid211=17+1139|gid2957=18+1139|^
gid76=19+455|gid3=20+569|gid81=21+1139|gid74=22+1139|^
gid88=23+1139|gid92=24+1024|gid2937=25+1139|gid87=^
26+569|gid3=27+569|gid80=28+1706|gid68=29+1139|gid17^
=30+569|gid3=31+569|^gid12557=32+2048|gid12597=33+2048^
|gid29212=34+2048|^gid16216=35+2048|gid13507=36+2048|^
gid14743=37+2048|gid19319=38+2048|gid4589=39+2048]

💡 Note: if we run the above command multiple times, we are not guaranteed to get the glyph list in the same order each time.

We can feed the unique gids—46, 2985, … 4589— into hb-subset to create a font subset program if we choose to.

To get CLI options for hb-shape, we can run:

C:\PF\harfbuzz\build\util\hb-shape.exe --help

To get just the glyph IDs for the text:

C:\PF\harfbuzz\build\util\hb-shape.exe --no-glyph-names^ 
--no-positions --no-clusters C:\Windows\Fonts\arialuni.ttf^ 
"Kỷ độ Long Tuyền đới nguyệt ma. 幾度龍泉戴月磨。"

The output is now simplified to:

[46|2985|3|211|2955|3|47|82|81|74|3|55|88|92|2931|^
81|3|211|2957|76|^3|81|74|88|92|2937|87|3|80|68|17|^
3|12557|12597|29212|16216|13507|14743|19319|4589]

🐧 – Using a different font program, we get different glyph IDs:

$ hb-shape /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf^ 
"Kỷ độ Long Tuyền đới nguyệt ma. 幾度龍泉戴月磨。"

Output:

[gid44=0+621|gid460=1+521|gid1=2+224|gid194=3+620|^
gid430=4+606|gid1=5+224|gid45=6+527|gid80=7+606|gid79^
=8+610|gid72=9+564|gid1=10+224|gid53=11+544|gid86=^
12+607|gid90=13+514|gid406=14+554|gid79=15+610|gid1^
=16+224|gid194=17+620|gid432=18+606|gid74=19+275|^
gid1=20+224|gid79=21+610|gid72=22+564|gid86=23+607|^
gid90=24+514|gid412=25+540|gid85=26+377|gid1=27+224|^
gid78=28+926|gid66=29+563|gid15=30+278|gid1=31+224|^
gid5794=32+1000|gid5822=33+1000|gid17998=34+1000|^
gid8575=35+1000|gid6506=36+1000|gid7442=37+1000|^
gid11001=38+1000|gid20341=39+1000]

And:

$ hb-shape /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf^ 
--no-glyph-names --no-positions --no-clusters^ 
"Kỷ độ Long Tuyền đới nguyệt ma. 幾度龍泉戴月磨。"

gives:

[44|460|1|194|430|1|45|80|79|72|1|53|86|90|406|79|^
1|194|432|74|1|79|72|86|90|412|85|1|78|66|15|1|5794|^
5822|17998|8575|6506|7442|11001|20341]

The hb-subset CLI

The hb-subset CLI extracts only the characters we need from a font program. We can either provide the text directly or supply a unique list of glyph IDs that represent the text.

🪟 – Let’s take a look at glyph ID input. I’m using the list above, which contains duplicates:

C:\PF\harfbuzz\build\util\hb-subset.exe --glyphs=46,^
2985,3,211,2955,3,47,82,81,74,3,55,88,92,2931,81,3,^
211,2957,76,3,81,74,88,92,2937,87,3,80,68,17,3,^
12557,12597,29212,16216,13507,14743,19319,4589^ 
C:/Windows/Fonts/arialuni.ttf --output-file=subset.ttf

The screenshot below shows subset.ttf as viewed in BirdFont:

152-windows-font-subset.png
Windows

🐧 – Now let’s look at text input:

$ hb-subset /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf^ 
--text="Kỷ độ Long Tuyền đới nguyệt ma. 幾度龍泉戴月磨。"^ 
--output-file=subset.ttf

The screenshot below shows subset.ttf as viewed in BirdFont:

152-ubuntu-font-subset.png
Ubuntu

A Rust FFI Font Subsetting Using HarfBuzz’s Functions

We are replicating the subsetting functionality of the hb-subset CLI using Rust FFI to call into HarfBuzz’s functions. The structure of this program is quite similar to the HarfBuzz’s hb_version_string() program discussed in the first article — in this one, we’re simply calling more of HarfBuzz’s functions. 🦀 This program is also a one-off standalone; we’ll use the illustrated code to develop future programs, but we won’t be making changes to it.

We note the following:

⓵ In harfbuzz_font_subset/build.rs: Added hb-subset.h. The compiler will generate a new bindings.rs module, which will include the required HarfBuzz subset functions.

⓶ In harfbuzz_font_subset/src/main.rs:

Lines 11–23: We import the functions needed for subsetting.

Lines 53–55: We use character Unicode values, effectively subsetting based on the actual text. The text is hardcoded for simplicity.

● Note the lack of error handling — this is intentional to keep the code simple for illustration purposes.

Lines 65–72:

    let result = slice.to_vec();

    unsafe { hb_blob_destroy(blob) };
    unsafe { hb_face_destroy(subset_face) };
    unsafe { hb_subset_input_destroy(input) };
    unsafe { hb_face_destroy(face) };

    fs::write(output_font_file, result).unwrap();

🙏 I purposely access result last, after freeing all HarfBuzz-allocated memory. This ensures that the call to slice.to_vec() produces memory owned by Rust, rather than still managed by HarfBuzz.

🐧 On Ubuntu, all required libraries are globally recognized. 🪟 On Windows, I haven’t added the paths for harfbuzz.dll, harfbuzz-subset.dll, and their dependencies to the PATH environment variable. So in each new Windows terminal session, I run the following once:

set PATH=C:\PF\harfbuzz\build\src\;%PATH%
set PATH=C:\PF\vcpkg\installed\x64-windows\bin\;%PATH%

After that, cargo run works as expected.

Viewing win_subset.ttf and linux_subset.ttf in BirdFont should show the same results as the font subset files created by the hb-subset CLI.

💡 Subset Using Glyph IDs

For an illustration of how to subset using glyph IDs, please refer to harfbuzz_font_subset/src/main_glyph.rs — rename it to main.rs to build and run. It uses the harfbuzz_font_subset/src/glyph.rs module.

Other Crates

I’m aware of other Rust crates related to HarfBuzz, notably the hb-subset crate. In fact, my initial attempt at subsetting was through this crate. However, it is outdated, so I chose the FFI route rather than patching the crate.

What’s Next

That wraps up the font subsetting illustration process. There are still more than 1,300 warnings, but I’m not too worried about them. In the next instalment, we’ll extend the code in the harfbuzz_font_subset/src/main.rs module into a generic function:

pub fn get_font_subset(input_font_file: &str, 
    face_index: u32, 
    text: &str
) -> Result<Vec<u8>, String>

to perform font subsetting generically, as part of a polyglot (multilingual) PDF creation workflow using the lopdf crate.

Thanks for reading! I hope this post helps others on the same journey.
As always—stay curious, stay safe 🦊

✿✿✿

Feature image sources:

🦀 Index of the Complete Series.