Extracting a specific UTF-8 substring from text requires accurate byte indexes for the starting and ending characters. I recently encountered this challenge and am sharing my approach to resolving it. While I am unsure if this is the most optimal solution, it serves my needs for now.

138-feature-image.png
Rust: Working with UTF-8 Text

Understanding UTF-8 Character Counting in Rust

UTF-8 characters are multibyte, meaning the total number of bytes in a UTF-8 string is greater than the total number of characters. The len() method returns the total number of bytes, whereas String's chars()'s count() returns the total number of characters.

The chars() method provides an iterator over the characters in a string, and count() returns the total number of items in the iterator.

The following example demonstrates these methods:

fn main() {
    let str = String::from("'This Autumn Will \
        End' (秋の終わり): a poem by Yosano Akiko.");

    let byte_count = str.len();
    let char_count = str.chars().count();

    println!("byte_count: {byte_count}");
    println!("char_count: {char_count}");
}

In this example, the string contains 65 bytes (byte_count: 65) but only 55 characters (char_count: 55). You can run this code in Rust Playground to see the output.

Determining Byte Boundaries in a UTF-8 String

The iterator nth(usize) method returns a primitive type char at the specified index in a string, while the char type's len_utf8() method provides the byte size of the character.

By applying the methods discussed earlier, we can determine the byte boundaries of each character—whether ASCII or UTF-8—within a string. The example below illustrates this:

fn main() {
    let str = String::from("'This Autumn Will \
        End' (秋の終わり): a poem by Yosano Akiko.");

    let char_count = str.chars().count();
    let mut total_byte_count: usize = 0;
    
    let mut current: usize = 0;
    let mut char_slicing_index = 0;
    
    while current < char_count {
        if let Some(c) = str.chars().nth(current) {
            let byte_count = c.len_utf8();
            
            total_byte_count += byte_count;
            
            println!("{}. char: [{}], byte size: [{}], char slicing index: [{}], total byte count: [{}]", 
                     current+1, c, byte_count, char_slicing_index, total_byte_count);
                     
            char_slicing_index += byte_count;
        }
        
        current += 1;
    }
}

Run this example in Rust Playground to observe its output.

Sample Output (Shortened):

1. char: ['], byte size: [1], char slicing index: [0], total byte count: [1]
2. char: [T], byte size: [1], char slicing index: [1], total byte count: [2]
...
24. char: [(], byte size: [1], char slicing index: [23], total byte count: [24]
25. char: [秋], byte size: [3], char slicing index: [24], total byte count: [27]
26. char: [の], byte size: [3], char slicing index: [27], total byte count: [30]
27. char: [終], byte size: [3], char slicing index: [30], total byte count: [33]
28. char: [わ], byte size: [3], char slicing index: [33], total byte count: [36]
29. char: [り], byte size: [3], char slicing index: [36], total byte count: [39]
30. char: [)], byte size: [1], char slicing index: [39], total byte count: [40]
...
54. char: [o], byte size: [1], char slicing index: [63], total byte count: [64]
55. char: [.], byte size: [1], char slicing index: [64], total byte count: [65]

Explanation:

The char slicing index represents the starting byte position of each character, while the total byte count indicates the ending position. Additionally, the total byte count accumulates the total number of bytes up to the current character.

Extracting a Single Character Using String Slicing in Rust

In this example, we use the char slicing index and total byte count indexes to extract the second character of the string. The target character is T, which has a char slicing index of 1 and a total byte count of 2:

fn main() {
    let str = String::from("'This Autumn Will \
        End' (秋の終わり): a poem by Yosano Akiko.");

    let substr = str[1..2].to_string();

    println!("[{substr}]");
}

Run this example in Rust Playground; the output will be [T] as expected.

Extracting UTF-8 Substrings in Rust

In this final example, we extract two specific substrings:

  • The 5 Japanese characters (秋の終わり), which are true UTF-8 characters.
    • char slicing index: 24
    • total byte count: 39
  • The second-to-last character (o):
  • char slicing index: 63
  • total byte count: 64

The following Rust code demonstrates this:

fn main() {
    let str = String::from("'This Autumn Will \
        End' (秋の終わり): a poem by Yosano Akiko.");

    // The 5 Japanse characters: 秋の終わり
    let substr = str[24..39].to_string();

    println!("[{substr}]");

    // The last o:
    let substr = str[63..64].to_string();
    
    println!("[{substr}]");    
}

Run this example in Rust Playground, and the expected output will be:

[秋の終わり]
[o]

❺ So basically, we are keeping track of the byte boundaries of characters in a string to extract desired substrings. I also looked into the String's char_indices() method while writing this post, but I haven’t yet figured out how it could be applied to this task.

Thank you for reading. I hope you find this post helpful. Stay safe, as always.

✿✿✿

Feature image sources: