Rust: Working with UTF-8 Text
Extracting a specific UTF-8 substring from text requires accurate byte indexes for the starting and ending characters. I recently encountered this challenge and am sharing my approach to resolving it. While I am unsure if this is the most optimal solution, it serves my needs for now.
![]() |
---|
Rust: Working with UTF-8 Text |
❶ Understanding UTF-8 Character Counting in Rust
UTF-8 characters are multibyte, meaning the total number of bytes in a UTF-8 string
is greater than the total number of characters. The
len()
method returns the total number of bytes, whereas
String
's
chars()'s
count()
returns the total number of characters.
The chars()
method provides an iterator over the characters in a string, and count()
returns the total number of items in the iterator.
The following example demonstrates these methods:
fn main() {
let str = String::from("'This Autumn Will \
End' (秋の終わり): a poem by Yosano Akiko.");
let byte_count = str.len();
let char_count = str.chars().count();
println!("byte_count: {byte_count}");
println!("char_count: {char_count}");
}
In this example, the string contains 65 bytes (byte_count
: 65
)
but only 55 characters (char_count
: 55
).
You can run this code in
Rust Playground
to see the output.
❷ Determining Byte Boundaries in a UTF-8 String
The iterator
nth(usize)
method returns a primitive type char
at the specified index in a string,
while the char
type's
len_utf8()
method provides the byte size of the character.
By applying the methods discussed earlier, we can determine the byte boundaries of each character—whether ASCII or UTF-8—within a string. The example below illustrates this:
fn main() {
let str = String::from("'This Autumn Will \
End' (秋の終わり): a poem by Yosano Akiko.");
let char_count = str.chars().count();
let mut total_byte_count: usize = 0;
let mut current: usize = 0;
let mut char_slicing_index = 0;
while current < char_count {
if let Some(c) = str.chars().nth(current) {
let byte_count = c.len_utf8();
total_byte_count += byte_count;
println!("{}. char: [{}], byte size: [{}], char slicing index: [{}], total byte count: [{}]",
current+1, c, byte_count, char_slicing_index, total_byte_count);
char_slicing_index += byte_count;
}
current += 1;
}
}
Run this example in Rust Playground to observe its output.
Sample Output (Shortened):
1. char: ['], byte size: [1], char slicing index: [0], total byte count: [1]
2. char: [T], byte size: [1], char slicing index: [1], total byte count: [2]
...
24. char: [(], byte size: [1], char slicing index: [23], total byte count: [24]
25. char: [秋], byte size: [3], char slicing index: [24], total byte count: [27]
26. char: [の], byte size: [3], char slicing index: [27], total byte count: [30]
27. char: [終], byte size: [3], char slicing index: [30], total byte count: [33]
28. char: [わ], byte size: [3], char slicing index: [33], total byte count: [36]
29. char: [り], byte size: [3], char slicing index: [36], total byte count: [39]
30. char: [)], byte size: [1], char slicing index: [39], total byte count: [40]
...
54. char: [o], byte size: [1], char slicing index: [63], total byte count: [64]
55. char: [.], byte size: [1], char slicing index: [64], total byte count: [65]
Explanation:
The char slicing index
represents the starting byte position of each character, while the total byte count
indicates the ending position. Additionally, the total byte count
accumulates the total number of bytes up to the current character.
❸ Extracting a Single Character Using String Slicing in Rust
In this example, we use the char slicing index
and total byte count
indexes to extract the second character of the string. The target character is T, which has a char slicing index
of 1
and a total byte count
of 2
:
fn main() {
let str = String::from("'This Autumn Will \
End' (秋の終わり): a poem by Yosano Akiko.");
let substr = str[1..2].to_string();
println!("[{substr}]");
}
Run this example in Rust Playground;
the output will be [T]
as expected.
❹ Extracting UTF-8 Substrings in Rust
In this final example, we extract two specific substrings:
-
The 5 Japanese characters (秋の終わり), which are true UTF-8 characters.
-
char slicing index
:24
-
total byte count
:39
-
-
The second-to-last character (
o
): -
char slicing index
:63
-
total byte count
:64
The following Rust code demonstrates this:
fn main() {
let str = String::from("'This Autumn Will \
End' (秋の終わり): a poem by Yosano Akiko.");
// The 5 Japanse characters: 秋の終わり
let substr = str[24..39].to_string();
println!("[{substr}]");
// The last o:
let substr = str[63..64].to_string();
println!("[{substr}]");
}
Run this example in Rust Playground, and the expected output will be:
[秋の終わり]
[o]
❺ So basically, we are keeping track of the byte boundaries of characters in a
string to extract desired substrings. I also looked into the
String
's
char_indices()
method while writing this post, but I haven’t yet figured out how it could be applied to this task.
Thank you for reading. I hope you find this post helpful. Stay safe, as always.
✿✿✿
Feature image sources: