Rust: Multilingual PDFs — an Introductory Study | behai-nguyen software development learnings and documentation

A “multilingual” PDF simply means a PDF document that supports displaying, copying, and pasting text in any Unicode-supported language.

I’ve used several third-party libraries to generate multilingual PDF documents in previous jobs. I was under the impression that as long as we used the correct fonts, the text would render automatically—and so would copy and paste. But when I tried to create a multilingual PDF with Rust, I soon realised it’s not that simple!

With help from ChatGPT and Copilot, we discovered that quite a bit of work is needed to get everything working. It’s not as simple as just embedding a font program in the PDF document.

In this article, we’ll take an introductory look at creating a multilingual PDF document that includes both Chinese and Vietnamese text. The Rust code we’re writing runs on both Windows and Ubuntu.

🦀 Index of the Complete Series.


Rust: Multilingual PDFs — an Introductory Study

🚀 The code for this post is in the following GitHub repository: pdf_01.

This post requires the HarfBuzz text shaping engine, as discussed in the
first article. This engine is required for font subsetting, a process we explored in the second article.

❶ The lopdf Crate and Some Referenced Documentation

⓵ After studying several crates, I decided on lopdf, based on the assessments of both Copilot and ChatGPT. In summary:

Strength
- Low-level building blocks for the PDF spec.
- Excellent for parsing, editing, and merging.
- Fine-grained access to PDF objects, dictionaries, and streams.
- Can create and also parse/modify existing PDFs.
- Actively used in other Rust PDF projects.
Limitations
- Not ideal for generating new documents from scratch.
- Steeper learning curve; requires familiarity with the PDF spec.
- Very low level — essentially working with the PDF object model (COS objects: a PDF file is structured as a tree of low-level objects).
Best Use Case — PDF post-processing: for users who want full control over PDF internals (merging, splitting, editing metadata, repairing malformed files, etc.).

⓶ Based on the lopdf documentation:

A useful reference for understanding the PDF file format and the eventual usage of this library is the PDF 1.7 Reference Document. The PDF 2.0 specification is available here.

The PDF 1.7 Reference Document is a punishing read… I’ve gone over chapter 9 Text several times. The rest I haven’t covered. In addition to this reference, Adobe has published perhaps thousands of technical documents. I’ve read the following two:

Adobe Technical Note #5014, Adobe CMap and CIDFont Files Specification — Explains embedding and subsetting in practice.
Adobe Technical Note #5411, ToUnicode Mapping File Tutorial — Explains how copy and paste works.

I’ve also read the following six articles by Mr. Jay Berkenbilt, which I found very helpful. The first is a standalone article; the next five form a series:

In one of the articles, the author demonstrates that we can actually handcraft a PDF document using just a normal text editor!

❷ The ttf-parser Crate

This crate seems to be very popular—at the time of writing this article, it already has 40,306,200 downloads. I understand that it is written entirely in Rust, and therefore does not require any external libraries. We can add this crate via cargo add ttf-parser.

There are two methods in use that I’d like to mention.

⓵ The first is pub fn units_per_em(&self) -> u16 — This u16 value defines the internal unit system of the font program. We use this value to prepare font programs to be device-independent, so they can be scaled and rendered consistently across different environments like screens, printers, and PDFs.

We can try out the units_per_em() method using the Rust script below. Replace C:/Windows/Fonts/arialuni.ttf with your own font path:

Content of pdf_01/src/main_ttf_parser_units_per_em.rs:

use std::{fs, process};
use ttf_parser::Face;

fn main() {
    // Load font file
    let font_data = match fs::read("C:/Windows/Fonts/arialuni.ttf") {
        Ok(res) => res,
        Err(err) => {
            println!("Error: {}", err);
            process::exit(1);
        }
    };

    // Font metric
    let face = Face::parse(&font_data, 0).expect("TTF parse");
    // Parse with ttf-parser for glyph indices + metrics
    let units_per_em = face.units_per_em() as f32;

    println!("units_per_em: {}", units_per_em);
}

⓶ The second method is pub fn glyph_index(&self, code_point: char) -> Option<GlyphId> — Loosely speaking, this GlyphId is the font program’s (arialuni.ttf) internal representation of the character code_point. PDF readers use this GlyphId to render the character.

Content of pdf_01/src/main_ttf_parser_glyph_index.rs:

use std::{fs, process};
use ttf_parser::Face;

fn main() {
    // Load font file
    let font_data = match fs::read("C:/Windows/Fonts/arialuni.ttf") {
        Ok(res) => res,
        Err(err) => {
            println!("Error: {}", err);
            process::exit(1);
        }
    };

    // Font metric
    let face = Face::parse(&font_data, 0).expect("TTF parse");

    let strs = vec![
        "Kỷ độ Long Tuyền đới nguyệt ma.",
        "幾度龍泉戴月磨。",
    ];  
    
    for str in strs {
        let mut glyphs_for_text = Vec::<u16>::new();    

        for ch in str.chars() {
            if let Some(gid) = face.glyph_index(ch) {
                glyphs_for_text.push(gid.0);
            } else {
                println!("Glyph ID not found for {ch}.");
            }            
        };

        println!("Text: [{str}].");
        println!("glyphs_for_text: {:?}", glyphs_for_text);
    }
}

🙏 Can you see that the contents of glyphs_for_text match the output of 🪟 hb-shape that we discussed in the second article?

❸ The PDFXplorer Windows GUI Application

This application enables us to inspect the internal structure of PDF documents. It also allows us to export certain data. It’s a very powerful tool—at least in my opinion.

We can download the installer from https://pdfxplorer.dev/. It is free software. The screenshot below shows a portion of the structure of a PDF generated by the Rust code in this article:

We can see internal objects discussed in the reference documents listed above.

❹ The Rust Code

💡 Please note: on both Windows and Ubuntu, I’m running Rust version rustc 1.90.0 (1159e78c4 2025-09-14).

This is once again a one-off project—I don’t plan to update it in future development. I’d like to keep a log of progress exactly as it took place. Future code may copy this and make changes to it. I’ve placed the project under the pdf_01 directory. The structure is simple:

├── build.rs
├── Cargo.toml
└── src
    ├── main.rs
    ├── pdf_font_info.rs
    ├── pdf_gen.rs
    ├── pdf_page.rs
    └── subset_builder.rs

💡 As mentioned at the outset, this code requires the HarfBuzz text shaping library—it makes FFI calls into this library. 🐧 On Ubuntu, all required libraries are globally recognized. 🪟 On Windows, I haven’t added the paths for harfbuzz.dll, harfbuzz-subset.dll, and their dependencies to the PATH environment variable. So in each new Windows terminal session, I run the following once:

set PATH=C:\PF\harfbuzz\dist\bin\;%PATH%
set PATH=C:\PF\vcpkg\installed\x64-windows\bin\;%PATH%

After that, cargo run works as expected.

🦀 Again, to keep things simple, we use absolute paths for everything in the code— mostly font program files. Chances are you’ll need to modify the code to work with your system.

⓵ The pdf_01/build.rs module: This is an exact copy of the harfbuzz_font_subset/build.rs discussed in the second article.

⓶ The pdf_01/Cargo.toml: Similar to the previous two, with two new [dependencies]: rand and ttf-parser.

⓷ The pdf_01/src/subset_builder.rs module: At the conclusion of the second article, we mentioned that we’d turn the code in the harfbuzz_font_subset/src/main.rs module into a generic function:

pub fn get_font_subset(input_font_file: &str, 
    face_index: u32, 
    text: &str
) -> Result<Vec<u8>, String>

This module is that function — with error handling.

⓸ The pdf_01/src/pdf_font_info.rs module: A helper module. It’s only 81 lines including comments and should be self-explanatory.

⓹ The pdf_01/src/pdf_page.rs module: Attempts to encapsulate input PDF pages. At this initial stage, pages contain only text. Each page also has a vector of u8, representing the text’s character glyph IDs (CIDs) as big-endian u16 bytes. We described extracting glyph IDs in a previous section.

The page collection — struct PdfPages defines used_cids, which are the unique glyph IDs for text across all pages. The method prepare_used_cids_glyph_bytes() must be called explicitly to prepare text glyph IDs for further processing: This data enables characters beyond English to be recognised and rendered by PDF readers.

⓺ The pdf_01/src/pdf_gen.rs module: Responsible for generating the PDF document. The call sequence is listed below — the main() function calls the public function generate_pdf(), which then calls PdfPages::prepare_used_cids_glyph_bytes() as described, then calls prepare_pdf_doc(), and so on:

generate_pdf()
    |
    PdfPages::prepare_used_cids_glyph_bytes()
    |
    prepare_pdf_doc()
    |       |
    |       create_font_stream()
    |       create_font_descriptor()
    |
    prepare_shared_font()
    |       |
    |       get_width_maps()
    |       build_w_array()
    |       create_cid_font_type2()
    |       make_to_unicode_cmap()
    |       create_font_referencing_descendant()
    |       create_font_resources_id()
    |
    prepare_page_content() ➜ repeat for each page
    |
    finalise PDF creation

We won’t go into detail for each method — they implement internal objects 
covered in the <a href="#ref-docs">reference literature</a>. Notable points:

● Function make_to_unicode_cmap(): Implements copy and paste. To disable it, comment out line 215 in create_font_referencing_descendant(): "ToUnicode" => tounicode_id,. Without this, copying text from the generated PDF will result in tofu or rectangle boxes.

Please note: this function is incomplete and only works for a small amount of text, as noted in the inline comments. It’s a work in progress.

● In prepare_page_content(), we take the page’s glyph bytes (big-endian u16 values) and feed them to the PDF Tj operator to render the page text content.

● The Identity-H encoding: Specifies a direct 1:1 mapping between character codes and CIDs (glyph IDs) in the font. Code 0 maps to CID 0, code 1 to CID 1, and so on. It tells the PDF viewer not to apply any predefined character collection or re-encoding—just use the embedded font’s glyphs as-is. When using Identity-H, text data is typically written as big-endian u16 values to the Tj or TJ operators.

● /Registry (Adobe) /Ordering (Identity) /Supplement 0: This defines the character collection for the CIDFont. The “Identity” collection means no external mapping is used — glyph selection is entirely determined by the CIDs within the embedded font. The Registry and Supplement entries are formal identifiers required by the PDF spec, but in this case they simply indicate that the font uses its own internal identity mapping.

Together, Identity-H and (Adobe, Identity, 0) tell the PDF reader: “This is a Unicode-style CIDFont where the character codes map directly to glyph IDs inside the embedded font.”

⓻ The pdf_01/src/main.rs module: Calls functions from other modules to create a two-page PDF document. Please note that the text used in this document is also the same text used in the second article. The first page contains the single Vietnamese verse; the second page contains the single Chinese verse. This module should be self-explanatory.

❺ Examine Generated PDFs with PDFXplorer

Let’s take another look at the PDF generated on 🐧 Ubuntu using PDFXplorer. The screenshot below shows the font name defined by the get_font_info() function:

The screenshot below shows page 1 — the /Kids[0] object:

The long sequence of digits between angle brackets, i.e. <000...002>, represents the glyph bytes — big-endian u16 byte encoding of the text "Kỷ độ Long Tuyền đới nguyệt ma.".

Navigating through the document structure, we can see the objects we programmed in the code. We also observe PDF operators such as Tf, Td, Tj, etc., as described in the reference literature, particularly in the PDF 1.7 Reference Document, chapter 9 Text.

❻ Examine Generated PDFs with pdffonts

The HarfBuzz 🐧 Ubuntu installation we performed in the first article also installs the pdffonts CLI, which we can use to list the fonts used in a PDF. The screenshot below shows pdffonts in use:

We can see that there’s no issue with the font, and the font name matches the one shown by PDFXplorer — as expected, since this is the same PDF document.

❼ What’s Next

The Rust code presented in this article is incomplete, as noted, and not of much practical value beyond illustrating how to create a multilingual PDF. There’s still a lot to do.

Although I’ve used the lopdf crate in this article, I’m still not very familiar with it. I’ll need to spend more time studying this crate.

I have a lot of ideas to explore. I’m not yet sure what the next article in this series will be. As I continue to explore this subject, I’ll document it whenever there’s something worthwhile to share.

Thanks for reading! I hope this post helps others who are looking to deepen their understanding of PDF technology. As always—stay curious, stay safe 🦊

✿✿✿

Feature image sources: