Rust: PDFs — Pango and Cairo Layout — Supporting Bold, Italic, and Bold Italic Text | behai-nguyen software development learnings and documentation

Implementing support for bold, italic, and bold italic text in paragraphs. Following Markdown, these three indicators — **, *, and *** — are used. Adjacent and nested Markdown syntaxes, as well as escapes such as \* and \\, are supported. This article continues and extends the work from the eighth article. In addition to rendering all natural headers, the final PDF now styles paragraph text according to the Markdown instructions in the source text file.

🦀 Index of the Complete Series.


Rust: PDFs — Pango and Cairo Layout — Supporting Bold, Italic, and Bold Italic Text

🚀 The code for this post is in the following GitHub repository: pdf_06_text_styling.

💡 Please note that Pango also supports HTML markup. I am not taking that route because I prefer to retain as much control as possible over how the input text is processed. For the same reason, I choose not to use any of the Rust Markdown parser crates, and instead implement a minimal parser that provides only the required support.

❶ The Parser

We describe the features the parser supports and some of its known limitations. The pdf_06_text_styling/src/inline_parser.rs test suite, in particular the test Markdown constants, should illustrate the parser’s capabilities.

Also, the pdf_06_text_styling/text/essay.txt file provides a complete example of the supported Markdown.

💡 Please note, the term marker event is used to refer to a valid opening marker followed by a valid closing marker.

⓵ Supported Features

● Adjacent marker: a sequence of marker events. For example, — **Tưởng Vĩnh Kính**, Hồ Chí Minh Tại *Trung Quốc*, Thượng Huyền dịch, ***trang 339***.

● Nested marker: some marker events are enclosed within an outer marker event. For example, **Không đọc *sử* không đủ tư cách nói chuyện *chính trị*.**

● Escaped: the character \ signifies that the character following it is escaped. For example, \*not bold\* is interpreted as the literal string *not bold*. \\Úc Đại Lợi\\ is interpreted as \Úc Đại Lợi\.

⓶ Known Limitations

● Uneven marker indicators: the result may not be what we expect.

**Tưởng Vĩnh Kính***: results in Tưởng Vĩnh Kính, followed by *.
***Tưởng Vĩnh Kính**: results in *Tưởng Vĩnh Kính.
***Tưởng Vĩnh Kính*: results in ** followed by Tưởng Vĩnh Kính.

● Bold nested inside italic: for example, *-- **Sir John Seeley**, 1885* is not supported. I discovered this at the last minute; it results in -- Sir John Seeley, 1885.

To get the intended effect of -- Sir John Seeley, 1885, use adjacent marker events: *--* **Sir John Seeley***, 1885*.

💥 It is best to construct marker events as cleanly as possible; ambiguous marker events can produce unexpected results.

Some software such as Visual Studio Code and https://markdownlivepreview.com/ do not suffer from these limitations. Bringing this parser up to par with such software is not my objective, and is beyond my capabilities as well. I only aim to support a subset of Markdown that is sufficient for creating presentable PDFs.

❷ Repository Layout

💡 Please note: on both Windows and Ubuntu, I’m running Rust version rustc 1.90.0 (1159e78c4 2025-09-14).

This is once again a one‑off project—I don’t plan to update it in future development. I want to keep a log of progress exactly as it occurred. Future code may copy this and make changes to it. I’ve placed the project under the pdf_06_text_styling directory. The structure is:

.
├── Cargo.toml
├── set_env.bat
├── config
│   └── config.toml
├── src
│   ├── config.rs
│   ├── document.rs
│   ├── font_utils.rs
│   ├── inline_parser.rs
│   ├── main.rs
│   ├── main_start_01.rs
│   ├── main_start_02.rs
│   └── page_geometry.rs
├── text
│   └── essay.txt
└── .vscode
    └── launch.json

We describe some modules in the following subsections. The rest will be covered in the sections that follow.

⓵ The src/page_geometry.rs module is copied unchanged from the Rust: PDFs — Text Rotation with Cairo and Pango article.
👉 Changing any margin value in the A4_DEFAULT_MARGINS constant will change the layout of the text in the PDF.

⓶ The src/config.rs module is copied unchanged from the Rust: PDFs — Pango and Cairo Layout — Supporting Headers article.

⓷ 💡 The code requires the Pango, HarfBuzz, Cairo, etc. libraries. 🐧 On Ubuntu, all required libraries are globally recognised. 🪟 On Windows, I haven’t added the paths for the libraries’ DLLs to the PATH environment variable. In each new Windows terminal session, I run the following once:

set PATH=C:\PF\harfbuzz\dist\bin\;%PATH%
set PATH=C:\PF\vcpkg\installed\x64-windows\bin\;%PATH%
set PATH=C:\PF\pango\dist\bin;C:\PF\cairo-1.18.4\dist\bin;C:\PF\fribidi\dist\bin;%PATH%

Alternatively, you can simply run set_env.bat.
After that, cargo run works as expected.

⓸ 💡 In the fifth article, we discussed the PKG_CONFIG_PATH user environment variable. This setting applies to all later articles. I did not mention it again from the sixth article onward. In the set_env.bat above, I include setting this variable so that we don’t forget it and avoid potential surprises.

⓹ The text/essay.txt file — copied from the last article,
with Markdown added to text in paragraphs.

❸ Text Styling In a Nutshell

Pango provides a powerful and straightforward approach to text styling. We can summarise it as follows: first, apply the base font as usual; next, determine the byte‑range of the sub‑text you want to style, and apply attributes to those byte‑ranges to achieve the desired effects. 🦀 To get bold italic text, apply both bold and italic attributes to the same byte‑range.

We demonstrate this Pango approach in the pdf_06_text_styling/src/main_start_01.rs module. For the sake of simplicity, we use only single‑byte text: xy, bc, de. The new text‑styling code:

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
    let attrs = pango::AttrList::new();
    
    let mut bold = AttrInt::new_weight(Weight::Bold); 
    bold.set_start_index(0); 
    bold.set_end_index(9); 
    attrs.insert(bold);

    let mut italic = AttrInt::new_style(Style::Italic);
    italic.set_start_index(4);
    italic.set_end_index(5);
    attrs.insert(italic);

    let mut italic = AttrInt::new_style(Style::Italic);
    italic.set_start_index(8);
    italic.set_end_index(9);
    attrs.insert(italic);

    layout.set_attributes(Some(&attrs));

💡 Please note: the index parameters passed to AttrInt::set_start_index() and AttrInt::set_end_index() are byte indices, not character indices. UTF‑8 characters may span multiple bytes.

In the code above:

Bold is applied to the entire text, from x to e, inclusive.
Italic is applied to bc and de. Because bold already applies to the whole string, these segments become bold italic: bc and de.

The parser identifies these byte‑ranges automatically based on the positions of the marker events. Next, we look at the parser from an overview perspective.

❹ Overview of the Parser

The parser lives in the pdf_06_text_styling/src/inline_parser.rs module. Its API is simple:

pub fn parse_inline(markdown_text: &str) -> InlineParseResult

InlineParseResult encapsulates the result of parsing a single line (paragraph) of Markdown text. It exposes two pieces of data.

The first field is text: String. This is the text with all marker indicators (i.e. *) removed. Escaped asterisks are still represented by the 3‑byte character \u{E000}. Call the reserve_asterisk() function on this text to restore escaped * characters before giving it to Pango.

The second field is spans: Vec<Span>. This is the definition of the Span struct. Each Span represents a byte‑range—as discussed earlier—of a slice in text and its associated style. Recall that ***bold italic*** produces two Spans: one for bold italic and one for bold italic, resulting in bold italic.

Stripping out all inline documentation and test‑related code, the actual parser is fewer than 300 lines. Given the amount of inline documentation, we will not discuss the parser code in detail here. The documentation and the test methods should be sufficient to guide your understanding of the implementation.

❺ A Simple Example On Using the Parser

We now look at a simple example of how to apply the parser. The code is intentionally minimal: it parses a single line of Markdown text and writes it to a PDF. It assumes that the final clean text fits on a single line, so no measurement or layout logic is required.

This example is the pdf_06_text_styling/src/main_start_02.rs module, which is a refactored version of the earlier pdf_06_text_styling/src/main_start_01.rs example:

● create_font_attrs(): a generic method that creates the styling attributes for the text. It is based on the code shown in a previous discussion.

● And in the main() function:

84
85
86
87
88
89
90
91
92
93
94
95
96
97
    let markdown_text = r"**Không đọc *sử* không đủ tư cách nói chuyện *chính trị*.** \*";
    // let markdown_text = "***Không đọc sử không đủ tư cách nói chuyện chính trị.***";
    // let markdown_text = "( **Chính Ðạo, *Việt Nam Niên Biểu*, *Tập 1A***, trang 347 )";

    let res = parse_inline(markdown_text);

    let attrs = pango::AttrList::new();
    for span in res.spans() {
        for attr in create_font_attrs(span) {
            attrs.insert(attr);
        }
    }
    layout.set_attributes(Some(&attrs));
    layout.set_text(&reserve_asterisk(res.text()));

Calls parse_inline() to parse the Markdown text.
Uses the resulting Spans to create the appropriate styles for each byte‑range, and applies those styles.
Calls reserve_asterisk() on the resulting clean text to restore any escaped asterisks, then gives Pango this final text to render using the selected font and applied styles.

Before we discuss the final main code, let’s briefly cover the auxiliary modules.

❻ The Auxiliary Modules

⓵ The pdf_06_text_styling/src/document.rs module — copied from the Rust: PDFs — Pango and Cairo Layout — Supporting Headers article, with some refactorings:

Added enum SpanStyle and struct Span — we covered these in the Overview of the Parser and A Simple Example On Using the Parser sections.

struct Block — in the Paragraph variant, a new field spans: Vec<Span> has been added, which we will discuss in a later section.

Removed line_height from struct PositionedBlock — we will discuss this in a later section.

⓶ The pdf_06_text_styling/src/font_utils.rs module — the code here is not new:

The previous article’s to_pango_description() function is copied over.
The create_font_attrs() function discussed in A Simple Example On Using the Parser.

We have now covered all the groundwork. Next, we discuss integrating the parser into the PDF creation process.

❼ The Main Code

The final module, pdf_06_text_styling/src/main.rs, is a copy of the previous article’s pdf_05_header/src/main.rs module, with some refactoring. We discuss those changes in the sections that follow.

● The parse_blocks_from_file() function — for paragraph text, we now assume it is Markdown and parse it accordingly:

130
131
132
133
	} else {
		let InlineParseResult { text, spans } = parse_inline(&line);
		blocks.push(Block::Paragraph { text, spans });
	}

We discussed spans in a previous section. With this information available, we now have all the data required for measuring and pagination.

● The new prepare_layout_text() function replaces the previous block_text() function. The code in this new function follows the approach we have already discussed, and should be self‑explanatory.

● The previous measure_block() and output_positioned_block() functions repeatedly create pango::Layout objects, set the font, and set the text in order to measure line heights, perform pagination, and finally render the output. In this article, we prepare everything once and cache it. The two methods mentioned above then use this cached data to perform their work, rather than recalculating everything on the fly. We discuss this caching implementation next.

● The caching mechanism is made possible by the new struct PreparedBlock and the prepare_blocks() function, which returns a vector of PreparedBlock.

PreparedBlock — this struct represents a Pango-ready‑to‑render version of the semantic Block. The layout field contains complete layout data: individual lines derived from the Block::Paragraph’s text field that fit within the page width, right‑justified, and with font family, font size, and styling attributes already applied. The Block::Paragraph’s line_heights vector stores the height of each individual line. Styling can cause line heights to vary, which is why we removed the line_height field from struct PositionedBlock, as previously discussed.

The new prepare_blocks() function is a simplified version of the earlier measure_block() function. For each semantic Block, it computes a Pango-ready PreparedBlock and finally returns a vector of PreparedBlock.

It follows naturally that the total number of PreparedBlocks should always match the number of Blocks, while there may be more PositionedBlocks.

● The new measure_block() function now receives, as its parameter, a reference to the vector of PreparedBlock returned by the prepare_blocks() function. It performs its measurements based on this vector.

● The new output_positioned_block() function now receives a reference to a PreparedBlock. The overall flow of the code remains largely unchanged.

The screenshots below show some PDF pages generated on 🐧 Ubuntu:

❽ What’s Next

Implementing the parser took a while, but it was satisfying to see it completed. The next feature I would like to support is images with captions, where images are specified using relative paths, similar to how it is done in LaTeX.

Thanks for reading! I hope this post helps others who are looking to deepen their understanding of PDF technology. As always—stay curious, stay safe 🦊

✿✿✿

Feature image sources: