Rust: PDFs — Basic Text Layout | behai-nguyen software development learnings and documentation

In the last article we created a two-page PDF in which each page contained only a short Chinese and a Vietnamese sentence. In this article, we look at some basic text layout: how to fit a line of text within a given page width, and how many lines can fit within a given page height. We then create a simple PDF document with more than 70 pages of only Vietnamese text, using only a single font program and font size.

🦀 Index of the Complete Series.


Rust: PDFs — Basic Text Layout

🚀 The code for this post is in the following GitHub repository: pdf_02.

❶ Limitations and Objectives

⓵ Limitations—We focus on basic text layout. To keep the task simple, we limit the scope of this article to:

Using only one font program and one font size for the entire document.
The text contains only a single language — in this case, Vietnamese.
Punctuation marks and brackets are treated as part of the immediate “words.” (In Vietnamese, sequences of letters separated by spaces are morphemes rather than words.)
To further clarify this limitation, consider the note above (In Vietnamese, ... words.). (In, Vietnamese,, and words.) are each treated as single units for width calculation.
PDF paragraphs are not right-justified; they are right-ragged.

Although the text contains natural headers, we treat them simply as normal paragraphs.

⓶ Objective—We aim to understand the following essentials of text layout:

Given a page width in PostScript points, a font program, and a font size, how to break paragraphs into lines that fit the page width.
Then, given the page height (also in PostScript points), how many of those lines can be written to the page.

❷ Text Layout Overview

● Break each input text paragraph into individual tokens based on spaces.

● Shape each token and calculate its width using the current font program and font size. Store each (token, PostScript width) pair in a vector.

● Iterate through the token–width vector and build lines according to the page width.

This implementation is very rudimentary. Text layout, especially line breaking, has a long and well-established foundation. Among many approaches, one of the most well-known is the Knuth–Plass line-breaking algorithm.

The link for Breaking Paragraphs into Lines, the original paper by Knuth and Plass http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf, as quoted by Wikipedia, no longer works. After some searching, I was able to locate a copy and took the liberty of uploading it to my own Google Drive, since it is a publicly available document.

I have read both Dr. Plass’ thesis and the paper by Prof. Knuth and Dr. Plass. The former is very difficult to read, as it is highly math-centric. The latter is somewhat easier. Implementing the algorithm as described by Prof. Knuth and Dr. Plass would require a significant amount of work.

❸ A4 Page Geometry

⓵ Page size: We are working with A4 size. I have not been able to locate any official Adobe documentation on paper sizes. Searching with phrases such as pdf A4 size in postscript point returns International standard paper sizes in PostScript and PDF, which lists ISO 216 paper format dimensions in PostScript points (1 pt = 25.4/72 mm), rounded to the nearest integer value.

In PDF, we work with PostScript points, where 1 PostScript point = 1/72 inch exactly. 1 inch equals 25.4 mm. The A4 width and height in PostScript points are calculated as:

● Width: (210 ÷ 25.4) × 72 ≈ 595.2755 PostScript points.

● Height: (297 ÷ 25.4) × 72 ≈ 841.8897 PostScript points.

In the previous post, we were hardcoding 595 and 842:

316
"MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],

Viewing the PDF 1.7 Reference Document in PDFXplorer shows that the MediaBox is defined as (0, 0, 595.22, 842):

Please note:

● 0, 0 — the X and Y coordinates of the bottom-left corner.

● 595.22, 842 — the X and Y coordinates of the top-right corner.

⓶ Page margin: Google search suggests that there is no standard A4 page margin, which makes sense. A 20mm margin seems reasonable.

A4 page geometry is defined in the pdf_02/src/page_geometry.rs module, which we will discuss further in a later section.

❹ Repository Layout

💡 Please note: on both Windows and Ubuntu, I’m running Rust version rustc 1.90.0 (1159e78c4 2025-09-14).

This is once again a one-off project—I don’t plan to update it in future development. I want to keep a log of progress exactly as it occurred. Future code may copy this and make changes to it. I’ve placed the project under the pdf_02 directory. The structure is:

├── build.rs
├── Cargo.toml
├── set_env.bat
├── src
│   ├── main_rustybuzz.rs
│   ├── main_harfbuzz_text_shape.rs
│   ├── main_line_width.rs
│   ├── main_text_layout.rs
│   ├── main.rs
│   ├── page_geometry.rs
│   ├── pdf_font_info.rs
│   ├── pdf_gen.rs
│   ├── pdf_text.rs
│   ├── subset_builder.rs
│   └── text_layout.rs
└── text
    └── essay.txt

The first four modules under src/—main_*.rs—are self-contained Rust programs that I wrote in the listed order to help me understand text shaping. We discuss these in the Text Shaping Investigative Code section. The text/essay.txt file is the Vietnamese input text, which the article’s main code converts into a PDF document. We discuss this code in the The Article Main Code section.

❺ Text Shaping Investigative Code

These are the four self-contained modules under src/ prefixed with main_*.rs, as previously described. To activate these modules, manually update the pdf_02/Cargo.toml file as follows:

...
[[bin]]
name = "pdf_02"

# path = "src/main.rs"

path = "src/main_rustybuzz.rs"
# path = "src/main_harfbuzz_text_shape.rs"
# path = "src/main_line_width.rs"
# path = "src/main_text_layout.rs"

[dependencies]
...
rustybuzz = "0.20.1"
...

The next three modules do not require the rustybuzz crate:

...
[[bin]]
name = "pdf_02"

# path = "src/main.rs"

# path = "src/main_rustybuzz.rs"
path = "src/main_harfbuzz_text_shape.rs"
# path = "src/main_line_width.rs"
# path = "src/main_text_layout.rs"

[dependencies]
...
# rustybuzz = "0.20.1"
...

Recall that the primary objective is to break paragraphs into lines that fit a given page width, for a specific font program and font size.

⓵ pdf_02/src/main_rustybuzz.rs—To calculate the total width in PostScript points of a word (or a Vietnamese morpheme), we need to know the width of individual characters—or more precisely, the width of each glyph, which is the visual representation of a character. This process is called text shaping. The rustybuzz crate is the native Rust implementation of the HarfBuzz library’s text shaping algorithm.

This module should be self-explanatory if you have read the last two articles in this series. We have already covered units per em in a previous article. Font size in PostScript points ÷ units per em gives a scaling factor that converts from the font’s internal design units to physical units. Then glyph’s x_advance × scaling factor expresses the advance in PostScript points, which is the unit PDF uses for text layout.

🪟 Windows output:

"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt C:/Windows/Fonts/arialuni.ttf is 179.45 pt wide

🐧 Ubuntu output:

"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf is 183.94 pt wide

We study the rustybuzz crate as a point of interest, but we are not going to use it. Since we still need to rely on the HarfBuzz library, which already provides this functionality, our focus is on understanding how to determine the glyphs’ x_advance values.

⓶ pdf_02/src/main_harfbuzz_text_shape.rs—I did not write this module entirely by myself. I performed a Google search for HarfBuzz text shaping example, and Google AI Overview provided a sample in C that included the glyph’s x_advance field. I converted the given C example into Rust, and Copilot suggested two helper functions: get_glyph_info() and get_glyph_pos().

The code in this module uses FFI, which we have already covered in earlier articles of this series. It should not be too difficult to follow.

🪟 Windows output:

Shaped text glyph information:
Glyph ID: 46, Cluster: 0, X Advance: 1366, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 2985, Cluster: 1, X Advance: 1024, Y Advance: 0, X Offset: 0, Y Offset: 0
...omitted 27 entries...
Glyph ID: 68, Cluster: 41, X Advance: 1139, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 17, Cluster: 42, X Advance: 569, Y Advance: 0, X Offset: 0, Y Offset: 0

🐧 Ubuntu output:

Shaped text glyph information:
Glyph ID: 44, Cluster: 0, X Advance: 621, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 460, Cluster: 1, X Advance: 521, Y Advance: 0, X Offset: 0, Y Offset: 0
...omitted 27 entries...
Glyph ID: 66, Cluster: 41, X Advance: 563, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 15, Cluster: 42, X Advance: 278, Y Advance: 0, X Offset: 0, Y Offset: 0

This exploration of HarfBuzz shaping prepares us for the next step: measuring line widths and understanding how shaped glyph advances translate into text layout.

⓷ pdf_02/src/main_line_width.rs—This module is a refactored version of main_harfbuzz_text_shape.rs, incorporating the total width calculation implemented in main_rustybuzz.rs.

🪟 Windows output:

"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt C:/Windows/Fonts/arialuni.ttf is 179.45 pt wide

🐧 Ubuntu output:

"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf is 183.94 pt wide

🙏 Please note that the output matches exactly that of main_rustybuzz.rs.

This alignment confirms that our HarfBuzz-based implementation produces consistent results with the Rustybuzz example, and it sets the stage for the next module, where we move from measuring line widths to laying out entire lines of text.

⓸ pdf_02/src/main_text_layout.rs—We extend the code discussed in main_line_width.rs to implement the simple line breaking algorithm described in the Text Layout Overview section. We note the following:

⑴ The helper function width_in_point() is responsible for calculating the space required, in PostScript points, to draw all glyphs for words (morphemes) in the text.

⑵ The main() function is responsible for two main tasks:

Constructing the (token, PostScript width) vector discussed in the Text Layout Overview section. Please refer to lines 80 to 133.
Assembling the token entries in the (token, PostScript width) vector into lines that fit the given page width, also discussed in the Text Layout Overview section. This is from lines 135 to 159.

👉 Please note that the test text is only a single paragraph, and in the code we implicitly assumed this: we do not break the text into individual paragraphs using the \n newline character first.

⑶ In the main() function, please also note the variable space_width_in_pt in lines 115 to 116, and later its usage in line 144—if current_width + width + space_width_in_pt > a4_width {, and then in lines 150 to 151: current_line.push(' '); and current_width += width + space_width_in_pt;.

This makes sense: words (morphemes) are separated by spaces, and a space occupies width as well. We must therefore allocate horizontal width for them.

⑷ The two variables margin and a4_width in lines 136 to 137 are simplified hardcoded literal values of the geometries discussed in the A4 Page Geometry section.

🪟 Windows output:

Lịch sử Việt Nam từ năm 1945 đến nay, còn nhiều bí ẩn chưa được giải tỏa. Người bàng
...11 lines are omitted...
công cuộc phát triển cách mạng của họ sẽ dẫn đến 2 trường hợp:

🐧 Ubuntu output:

Lịch sử Việt Nam từ năm 1945 đến nay, còn nhiều bí ẩn chưa được giải tỏa. Người bàng
...12 lines are omitted...
hợp:

The differences in the two outputs are expected, since two different font programs are in use. The space requirements for glyphs differ accordingly.

This module demonstrates how shaped glyph widths can be assembled into full lines, bringing us closer to complete page layout in the next stage.

❻ The Article Main Code

💡 It should be clear that this code requires the HarfBuzz library.
🐧 On Ubuntu, all required libraries are globally recognized. 🪟 On Windows, I haven’t added the paths for harfbuzz.dll, harfbuzz-subset.dll, and their dependencies to the PATH environment variable. In each new Windows terminal session, I run the following once:

set PATH=C:\PF\harfbuzz\dist\bin\;%PATH%
set PATH=C:\PF\vcpkg\installed\x64-windows\bin\;%PATH%

Alternatively, you can simply run set_env.bat.
After that, cargo run works as expected.

🦀 To keep things simple, we use absolute paths for the font programs—
You will likely need to adjust the code to match your own system configuration.

⓵ The pdf_02/build.rs module: This is a copy of the code from the last article.

⓶ The pdf_02/Cargo.toml: This is a copy of the last article. We also discuss it briefly in the Text Shaping Investigative Code section.

⓷ The pdf_02/src/subset_builder.rs: This is also a copy from the last article.

⓸ The pdf_02/src/pdf_font_info.rs: This module also comes from the last article. There are some minor refactorings: fields are now private, and public getters have been added.

⓹ The new pdf_02/src/page_geometry.rs: This module implements the discussion in the A4 Page Geometry section. We define four margins. Since there are no standard margins, it is reasonable to assume that margins can be independent of one another; having four provides this flexibility. Increasing the value of any of these four margins should result in a PDF with more pages.

⓺ The new pdf_02/src/text_layout.rs: This is a refactored version of the main_text_layout.rs module as discussed. It exposes the public text_to_lines() function.

👉 We will refer to the return value of the text_to_lines() function as shaped lines.

💥 Please note, in the study module main_text_layout.rs we worked with only a single line, which was essentially a paragraph. In this module, the input text contains multiple paragraphs, i.e., multiple lines separated by newline characters. We first break the entire input text into a vector of strings, lines_vec, using the \n delimiter, and also keep the \n delimiters as entries in lines_vec. When iterating over lines_vec, if an entry is either \n or \r\n, we simply push a space (ASCII character 32) into the returned shaped lines vector and continue to the next lines_vec entry. These space-only shaped lines are rendered as blank lines in the final PDF. As a consequence of this, when we copy the text out of the PDF, including blank lines, the pasted text will not contain any blank lines: all newline characters from the input text are lost. Most PDF documents I have examined behave in a similar manner.

Together, these modules establish the foundation for assembling text into properly sized and margined PDF pages.

⓻ The new pdf_02/src/pdf_text.rs: This module is responsible for reading the input text file and preparing the text-related data for PDF output. The public API is the prepare() method. The PDF-ready data are:

⑴ PdfTextContent::font_subset: Vec<u8>: Defining the content of this vector is the responsibility of the subset_builder.rs module as discussed. We have also examined this task in detail in a previous article in this series.

The helper method text_font_subset() is responsible for generating the content for this PdfTextContent::font_subset: Vec<u8> field.

⑵ PdfTextContent::used_cids: Vec<u16>: We covered this collection in the last article. In this article, we simply refactor the previous implementation into a new module and a new struct, but the process remains the same.

The helper method text_used_cids_glyph_bytes() is responsible for generating the content for this PdfTextContent::used_cids: Vec<u16> field.

⑶ PdfTextContent::lines_glyph_bytes: Vec<Vec<u8>>: The text_layout.rs module has already broken the input text into individual lines that fit the given page width as described. Each Vec<u8> in this vector is the glyph bytes representation of a shaped line. We also encountered glyph bytes in the last article.

The vector of shaped lines is not directly useful for the PDF generation process. We discard it after generating glyph bytes for each line and storing them in the lines_glyph_bytes vector. This becomes the final text content written to the PDF document.

The helper method text_used_cids_glyph_bytes() is responsible for generating the content for this PdfTextContent::lines_glyph_bytes: Vec<Vec<u8>> field.

⑷ PdfTextContent::copy_paste_unicodes: Vec<u16>: We also implemented this in the last article. Please see the description of the make_to_unicode_cmap() function. Here, we simply moved the data generation process into this module.

The helper method text_copy_paste_unicodes() is responsible for generating the content for this PdfTextContent::copy_paste_unicodes: Vec<u16> field.

⓼ The existing pdf_02/src/pdf_gen.rs: “Existing” here means it is a copy of the code from the last article, with changes:

⑴ The new PdfTextContent parameter replaces the previous PdfPages.

⑵ In the ToUnicode map, the beginbfchar...endbfchar blocks now contain at most 100 entries, as specified in the PDF 1.7 Reference Document. This is accomplished via a new helper function tounicode_mapping().

⑶ The function prepare_page_content() was completely refactored. It takes the PDF-ready glyph bytes from the PdfTextContent::lines_glyph_bytes vector and generates PDF pages.

The new logic should be self-explanatory. 💥 It is important to understand how the PDF text operator Td behaves. Td tx ty moves the text cursor relative to its current position by (tx, ty). It does not set an absolute position on the page. The very first Td after BT starts relative to the origin:

258
259
260
261
262
263
264
265
266
fn new_page(font_size_pt: f32) -> Vec<Operation> {
    vec![
        Operation::new("BT", vec![]),
        // Set font F1 and size 12
        Operation::new("Tf", vec!["F1".into(), font_size_pt.into()]),
        Operation::new("Td", vec![A4_DEFAULT_MARGINS.left.into(), 
            a4_default_content_height().into()]), // start position
    ]
}

That is, for each new page, we start the first line at the page’s top-left corner. Then we move down the page by line_height_pt PostScript points, and write the line (or rather, its glyph bytes). We repeat this process until we reach the bottom of the page: if current_y - line_height_pt <= A4_DEFAULT_MARGINS.bottom. At that point, we flush the current PDF page and start a new one.

⑷ Other functions also take the new PdfTextContent parameter instead of PdfPages, but their logic remains the same.

Together, these changes ensure that the PDF generation process integrates seamlessly with the new text content structures, paving the way for a complete end-to-end workflow from input text to final PDF output.

⓽ pdf_02/src/main.rs: This module is brief and should be self-explanatory.

❼ Examine Generated PDFs with PDFXplorer

The screenshot below shows the content of the first PDF page on Windows:

The following screenshot shows the content of the first PDF page on Ubuntu:

We observe the following: the first text line starts at (dx=57, dy=785), with units in PostScript points. Each subsequent line then begins at (dx=0, dy=-14.400001), relative to the previous text position.

❽ What’s Next

This is just basic text layout, and I don’t consider the final result acceptable for production use. The code serves primarily as a learning exercise. Moving forward, we will focus more deeply on layout.
🪟 On Windows, I have successfully built and installed Pango, along with its two associated libraries: GNU FriBidi and CairoGraphics. I plan to use Pango for text layout in future work.

For the time being, however, I am focusing on exploring additional text features such as bold, italic, mixed font sizes, and multiple font programs. There is still much to learn.

Thanks for reading! I hope this post helps others who are looking to deepen their understanding of PDF technology. As always—stay curious, stay safe 🦊

✿✿✿

Feature image sources: