Rust: PDFs — Basic Text Layout
In the last article we created a two-page PDF in which each page contained only a short Chinese and a Vietnamese sentence. In this article, we look at some basic text layout: how to fit a line of text within a given page width, and how many lines can fit within a given page height. We then create a simple PDF document with more than 70 pages of only Vietnamese text, using only a single font program and font size.
🦀 Index of the Complete Series.
![]() |
|---|
| Rust: PDFs — Basic Text Layout |
🚀 The code for this post is in the following GitHub repository: pdf_02.
⓵ Limitations—We focus on basic text layout. To keep the task simple, we limit the scope of this article to:
- Using only one font program and one font size for the entire document.
- The text contains only a single language — in this case, Vietnamese.
-
Punctuation marks and brackets are treated as part of the immediate “words.”
(In Vietnamese, sequences of letters separated by spaces are morphemes rather than words.)
To further clarify this limitation, consider the note above
(In Vietnamese, ... words.).(In,Vietnamese,, andwords.)are each treated as single units for width calculation. - PDF paragraphs are not right-justified; they are right-ragged.
Although the text contains natural headers, we treat them simply as normal paragraphs.
⓶ Objective—We aim to understand the following essentials of text layout:
- Given a page width in PostScript points, a font program, and a font size, how to break paragraphs into lines that fit the page width.
- Then, given the page height (also in PostScript points), how many of those lines can be written to the page.
● Break each input text paragraph into individual tokens based on spaces.
● Shape each token and calculate its width using the current font program and
font size. Store each (token, PostScript width) pair in a vector.
● Iterate through the token–width vector and build lines according to the page width.
This implementation is very rudimentary. Text layout, especially line breaking, has a long and well-established foundation. Among many approaches, one of the most well-known is the Knuth–Plass line-breaking algorithm.
The link for Breaking Paragraphs into Lines, the original paper by Knuth and Plass http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf, as quoted by Wikipedia, no longer works. After some searching, I was able to locate a copy and took the liberty of uploading it to my own Google Drive, since it is a publicly available document.
I have read both Dr. Plass’ thesis and the paper by Prof. Knuth and Dr. Plass. The former is very difficult to read, as it is highly math-centric. The latter is somewhat easier. Implementing the algorithm as described by Prof. Knuth and Dr. Plass would require a significant amount of work.
⓵ Page size:
We are working with A4 size. I have not been able to locate any official Adobe
documentation on paper sizes. Searching with phrases such as
pdf A4 size in postscript point
returns
International standard paper sizes in PostScript and PDF,
which lists ISO 216 paper format dimensions in PostScript points (1 pt = 25.4/72 mm),
rounded to the nearest integer value.
In PDF, we work with PostScript points, where 1 PostScript point = 1/72 inch
exactly. 1 inch equals 25.4 mm. The A4 width and height in
PostScript points are calculated as:
● Width: (210 ÷ 25.4) × 72 ≈ 595.2755 PostScript points.
● Height: (297 ÷ 25.4) × 72 ≈ 841.8897 PostScript points.
In the
previous post,
we were
hardcoding 595 and
842:
316
"MediaBox" => vec![0.into(), 0.into(), 595.into(), 842.into()],
Viewing the
PDF 1.7 Reference Document
in PDFXplorer
shows that the MediaBox is defined as (0, 0, 595.22, 842):

Please note:
● 0, 0 — the X and Y coordinates of the bottom-left corner.
● 595.22, 842 — the X and Y coordinates of the top-right corner.
⓶ Page margin: Google search suggests that there is no standard
A4 page margin, which makes sense. A 20mm margin seems reasonable.
A4 page geometry is defined in the
pdf_02/src/page_geometry.rs module, which we will discuss
further in a later section.
💡 Please note: on both Windows and Ubuntu, I’m running Rust version
rustc 1.90.0 (1159e78c4 2025-09-14).
This is once again a one-off project—I don’t plan to update it in future development.
I want to keep a log of progress exactly as it occurred. Future code may
copy this and make changes to it. I’ve placed the project under the
pdf_02 directory. The structure is:
├── build.rs
├── Cargo.toml
├── set_env.bat
├── src
│ ├── main_rustybuzz.rs
│ ├── main_harfbuzz_text_shape.rs
│ ├── main_line_width.rs
│ ├── main_text_layout.rs
│ ├── main.rs
│ ├── page_geometry.rs
│ ├── pdf_font_info.rs
│ ├── pdf_gen.rs
│ ├── pdf_text.rs
│ ├── subset_builder.rs
│ └── text_layout.rs
└── text
└── essay.txt
The first four modules under src/—main_*.rs—are self-contained
Rust programs that I wrote in the listed order to help me understand
text shaping. We discuss these in the
Text Shaping Investigative Code section.
The text/essay.txt file is the Vietnamese input text, which the article’s
main code converts into a PDF document. We discuss this code in the
The Article Main Code section.
❺ Text Shaping Investigative Code
These are the four self-contained modules under src/ prefixed with
main_*.rs, as previously described.
To activate these modules, manually update the
pdf_02/Cargo.toml
file as follows:
...
[[bin]]
name = "pdf_02"
# path = "src/main.rs"
path = "src/main_rustybuzz.rs"
# path = "src/main_harfbuzz_text_shape.rs"
# path = "src/main_line_width.rs"
# path = "src/main_text_layout.rs"
[dependencies]
...
rustybuzz = "0.20.1"
...
The next three modules do not require the rustybuzz crate:
...
[[bin]]
name = "pdf_02"
# path = "src/main.rs"
# path = "src/main_rustybuzz.rs"
path = "src/main_harfbuzz_text_shape.rs"
# path = "src/main_line_width.rs"
# path = "src/main_text_layout.rs"
[dependencies]
...
# rustybuzz = "0.20.1"
...
Recall that the primary objective is to break paragraphs into lines that fit a given page width, for a specific font program and font size.
⓵
pdf_02/src/main_rustybuzz.rs—To calculate the total width in
PostScript points of a word (or a Vietnamese morpheme), we need to know the width of
individual characters—or more precisely, the width of each glyph, which is the visual
representation of a character. This process is called text shaping.
The rustybuzz crate is the native Rust implementation of the
HarfBuzz library’s text shaping algorithm.
This module should be self-explanatory if you have read the last two articles in this
series. We have already covered units per em in a
previous article.
Font size in PostScript points ÷ units per em gives a scaling factor
that converts from the font’s internal design units to physical units. Then
glyph’s x_advance × scaling factor expresses the advance in PostScript points,
which is the unit PDF uses for text layout.
🪟 Windows output:
"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt C:/Windows/Fonts/arialuni.ttf is 179.45 pt wide
🐧 Ubuntu output:
"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf is 183.94 pt wide
We study the
rustybuzz crate as a point of interest, but we are not going to use it.
Since we still need to rely on the
HarfBuzz library, which already provides this functionality, our focus
is on understanding how to determine the glyphs’ x_advance values.
⓶
pdf_02/src/main_harfbuzz_text_shape.rs—I did not write this
module entirely by myself. I performed a Google search for
HarfBuzz text shaping example, and Google AI Overview provided a
sample in C that included the glyph’s x_advance field. I converted the given
C example into Rust, and Copilot suggested two helper functions:
get_glyph_info() and
get_glyph_pos().
The code in this module uses FFI, which we have already covered in earlier articles of this series. It should not be too difficult to follow.
🪟 Windows output:
Shaped text glyph information:
Glyph ID: 46, Cluster: 0, X Advance: 1366, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 2985, Cluster: 1, X Advance: 1024, Y Advance: 0, X Offset: 0, Y Offset: 0
...omitted 27 entries...
Glyph ID: 68, Cluster: 41, X Advance: 1139, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 17, Cluster: 42, X Advance: 569, Y Advance: 0, X Offset: 0, Y Offset: 0
🐧 Ubuntu output:
Shaped text glyph information:
Glyph ID: 44, Cluster: 0, X Advance: 621, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 460, Cluster: 1, X Advance: 521, Y Advance: 0, X Offset: 0, Y Offset: 0
...omitted 27 entries...
Glyph ID: 66, Cluster: 41, X Advance: 563, Y Advance: 0, X Offset: 0, Y Offset: 0
Glyph ID: 15, Cluster: 42, X Advance: 278, Y Advance: 0, X Offset: 0, Y Offset: 0
This exploration of HarfBuzz shaping prepares us for the next step: measuring line widths and understanding how shaped glyph advances translate into text layout.
⓷
pdf_02/src/main_line_width.rs—This module is a refactored
version of
main_harfbuzz_text_shape.rs,
incorporating the total width calculation implemented in
main_rustybuzz.rs.
🪟 Windows output:
"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt C:/Windows/Fonts/arialuni.ttf is 179.45 pt wide
🐧 Ubuntu output:
"Kỷ độ Long Tuyền đới nguyệt ma." in 12 pt /home/behai/Noto_Sans_TC/NotoSansTC-Regular.ttf is 183.94 pt wide
🙏 Please note that the output matches exactly that of main_rustybuzz.rs.
This alignment confirms that our HarfBuzz-based implementation produces consistent results with the Rustybuzz example, and it sets the stage for the next module, where we move from measuring line widths to laying out entire lines of text.
⓸
pdf_02/src/main_text_layout.rs—We extend the code discussed
in main_line_width.rs to implement
the simple line breaking algorithm described in the
Text Layout Overview section. We note the following:
⑴ The helper function
width_in_point() is responsible for calculating the space
required, in PostScript points, to draw all glyphs for words (morphemes) in the text.
⑵ The main()
function is responsible for two main tasks:
-
Constructing the
(token, PostScript width)vector discussed in the Text Layout Overview section. Please refer to lines 80 to 133. -
Assembling the
tokenentries in the(token, PostScript width)vector into lines that fit the given page width, also discussed in the Text Layout Overview section. This is from lines 135 to 159.
👉 Please note that the test text is only a single paragraph, and in the code we
implicitly assumed this: we do not break the text into individual paragraphs using
the \n newline character first.
⑶
In the main() function, please also note the variable
space_width_in_pt in
lines 115 to 116, and later its usage in
line 144—if current_width + width + space_width_in_pt > a4_width {,
and then in
lines 150 to 151: current_line.push(' '); and
current_width += width + space_width_in_pt;.
This makes sense: words (morphemes) are separated by spaces, and a space occupies width as well. We must therefore allocate horizontal width for them.
⑷
The two variables margin and a4_width in
lines 136 to 137 are simplified hardcoded literal values of the
geometries discussed in the A4 Page Geometry section.
🪟 Windows output:
Lịch sử Việt Nam từ năm 1945 đến nay, còn nhiều bí ẩn chưa được giải tỏa. Người bàng
...11 lines are omitted...
công cuộc phát triển cách mạng của họ sẽ dẫn đến 2 trường hợp:
🐧 Ubuntu output:
Lịch sử Việt Nam từ năm 1945 đến nay, còn nhiều bí ẩn chưa được giải tỏa. Người bàng
...12 lines are omitted...
hợp:
The differences in the two outputs are expected, since two different font programs are in use. The space requirements for glyphs differ accordingly.
This module demonstrates how shaped glyph widths can be assembled into full lines, bringing us closer to complete page layout in the next stage.
💡 It should be clear that this code requires the HarfBuzz library.
🐧 On Ubuntu, all required libraries are globally recognized. 🪟 On Windows, I haven’t
added the paths for harfbuzz.dll, harfbuzz-subset.dll, and
their dependencies to the PATH environment variable. In each new Windows
terminal session, I run the following once:
set PATH=C:\PF\harfbuzz\dist\bin\;%PATH%
set PATH=C:\PF\vcpkg\installed\x64-windows\bin\;%PATH%
Alternatively, you can simply run
set_env.bat.
After that, cargo run works as expected.
🦀 To keep things simple, we use absolute paths for the font programs—
You will likely need to adjust the code to match your own system configuration.
⓵ The
pdf_02/build.rs module: This is a copy of the code from the
last article.
⓶ The
pdf_02/Cargo.toml:
This is a copy of the
last article.
We also discuss it briefly in the
Text Shaping Investigative Code section.
⓷ The
pdf_02/src/subset_builder.rs:
This is also a copy from the
last article.
⓸ The
pdf_02/src/pdf_font_info.rs:
This module also comes from the
last article.
There are some minor refactorings: fields are now private, and public getters have been added.
⓹ The new
pdf_02/src/page_geometry.rs:
This module implements the discussion in the
A4 Page Geometry section. We define four margins.
Since there are no standard margins, it is reasonable to assume that margins can be
independent of one another; having four provides this flexibility. Increasing the value
of any of these four margins should result in a PDF with more pages.
⓺ The new
pdf_02/src/text_layout.rs:
This is a refactored version of the main_text_layout.rs module as
discussed. It exposes the public
text_to_lines()
function.
👉 We will refer to the return value of the text_to_lines() function
as shaped lines.
💥 Please note, in the study module main_text_layout.rs we worked with only a
single line, which was essentially a paragraph. In this module, the input text contains
multiple paragraphs, i.e., multiple lines separated by newline characters. We first break
the entire input text into a vector of strings, lines_vec, using the
\n delimiter, and also keep the \n delimiters as entries in
lines_vec. When iterating over lines_vec, if an entry is either
\n or \r\n, we simply push a space (ASCII character 32) into
the returned shaped lines vector and continue to the next
lines_vec entry. These space-only shaped lines are rendered as blank lines
in the final PDF. As a consequence of this, when we copy the text out of the PDF,
including blank lines, the pasted text will not contain any blank lines: all newline
characters from the input text are lost. Most PDF documents I have examined
behave in a similar manner.
Together, these modules establish the foundation for assembling text into properly sized and margined PDF pages.
⓻ The new
pdf_02/src/pdf_text.rs:
This module is responsible for reading the input text file and preparing the text-related
data for PDF output. The public API is the
prepare() method. The PDF-ready data are:
⑴
PdfTextContent::font_subset: Vec<u8>:
Defining the content of this vector is the responsibility of the
subset_builder.rs module as discussed.
We have also examined this task in detail in a previous article in this series.
The helper method
text_font_subset() is responsible for generating the content for
this PdfTextContent::font_subset: Vec<u8> field.
⑵
PdfTextContent::used_cids: Vec<u16>: We covered this
collection in the
last article.
In this article, we simply refactor the previous implementation into a new module and a
new struct, but the process remains the same.
The helper method
text_used_cids_glyph_bytes() is responsible for generating the content for
this PdfTextContent::used_cids: Vec<u16> field.
⑶
PdfTextContent::lines_glyph_bytes: Vec<Vec<u8>>:
The text_layout.rs module has already broken the input text into individual
lines that fit the given page width as described. Each
Vec<u8> in this vector is the glyph bytes representation
of a shaped line. We also encountered
glyph bytes in the
last article.
The vector of shaped lines is not directly useful for the PDF generation process. We
discard it after generating glyph bytes for each line and storing them in
the lines_glyph_bytes vector. This becomes the final text content written
to the PDF document.
The helper method
text_used_cids_glyph_bytes() is responsible for generating the content for
this PdfTextContent::lines_glyph_bytes: Vec<Vec<u8>> field.
⑷
PdfTextContent::copy_paste_unicodes: Vec<u16>: We also
implemented this in the
last article.
Please see the description of the make_to_unicode_cmap() function. Here, we
simply moved the data generation process into this module.
The helper method
text_copy_paste_unicodes() is responsible for generating the content for
this PdfTextContent::copy_paste_unicodes: Vec<u16> field.
⓼ The existing
pdf_02/src/pdf_gen.rs:
“Existing” here means it is a copy of the code from the
last article,
with changes:
⑴
The new PdfTextContent parameter replaces
the previous PdfPages.
⑵
In the ToUnicode map, the beginbfchar...endbfchar blocks now
contain at most 100 entries, as specified in the
PDF 1.7 Reference Document.
This is accomplished via a new helper function
tounicode_mapping().
⑶
The function
prepare_page_content() was completely
refactored. It takes the PDF-ready glyph bytes from the
PdfTextContent::lines_glyph_bytes
vector and generates PDF pages.
The new logic should be self-explanatory. 💥 It is important to understand how the PDF
text operator Td behaves. Td tx ty moves the text
cursor relative to its current position by (tx, ty). It does
not set an absolute position on the page. The very first Td
after BT starts relative to the origin:
258
259
260
261
262
263
264
265
266
fn new_page(font_size_pt: f32) -> Vec<Operation> {
vec![
Operation::new("BT", vec![]),
// Set font F1 and size 12
Operation::new("Tf", vec!["F1".into(), font_size_pt.into()]),
Operation::new("Td", vec![A4_DEFAULT_MARGINS.left.into(),
a4_default_content_height().into()]), // start position
]
}
That is, for each new page, we start the first line at the page’s top-left corner.
Then we move down the page by line_height_pt PostScript points, and write
the line (or rather, its glyph bytes). We repeat this process until we reach the bottom
of the page: if current_y - line_height_pt <= A4_DEFAULT_MARGINS.bottom.
At that point, we flush the current PDF page and start a new one.
⑷
Other functions also take
the new PdfTextContent parameter
instead of PdfPages, but their logic remains the same.
Together, these changes ensure that the PDF generation process integrates seamlessly with the new text content structures, paving the way for a complete end-to-end workflow from input text to final PDF output.
⓽ pdf_02/src/main.rs:
This module is brief and should be self-explanatory.
❼ Examine Generated PDFs with PDFXplorer
The screenshot below shows the content of the first PDF page on Windows:

The following screenshot shows the content of the first PDF page on Ubuntu:

We observe the following: the first text line starts at (dx=57, dy=785),
with units in PostScript points. Each subsequent line then begins at
(dx=0, dy=-14.400001), relative to the previous text position.
This is just basic text layout, and I don’t consider the final result acceptable for
production use. The code serves primarily as a learning exercise. Moving forward, we
will focus more deeply on layout.
🪟 On Windows, I have successfully built and installed
Pango, along with its two associated libraries:
GNU FriBidi
and CairoGraphics.
I plan to use Pango for text layout in future work.
For the time being, however, I am focusing on exploring additional text features such as bold, italic, mixed font sizes, and multiple font programs. There is still much to learn.
Thanks for reading! I hope this post helps others who are looking to deepen their understanding of PDF technology. As always—stay curious, stay safe 🦊
✿✿✿
Feature image sources:
- https://www.omgubuntu.co.uk/2024/03/ubuntu-24-04-wallpaper
- https://in.pinterest.com/pin/337277459600111737/
- https://www.rust-lang.org/
- https://www.pngitem.com/download/ibmJoR_rust-language-hd-png-download/
- https://ur.wikipedia.org/wiki/%D9%81%D8%A7%D8%A6%D9%84:HarfBuzz.svg
