Skip to content

Conversation

@overlookmotel
Copy link
Contributor

@overlookmotel overlookmotel commented Nov 30, 2025

#144472 made str::floor_char_boundary a const function, but in doing so introduced a loop. This is unnecessary because the next UTF-8 character boundary can only be within a 4-byte range.

This produces excessive code for e.g. str.floor_char_boundary(20), because the loop is unrolled.

https://godbolt.org/z/5f3YsM6oK

This PR replaces the loop with 3 checks in series.

In addition to reducing code size in some cases, it also removes bounds checks from calling code when following floor_char_boundary with a call to String::truncate (which I assume might be a common pattern).

Notes

  • I tried using index.unchecked_sub(1), but found it doesn't remove bounds checks, whereas index.checked_sub(1).unwrap_unchecked() does. Surprising!

  • The assert_uncheckeds are required to elide checks from code following the floor_char_boundary call e.g.:

let index = string.floor_char_boundary(20);
string.truncate(index);

If this PR is accepted, I'll follow up with a similar PR for ceil_char_boundary.

Very happy to receive feedback and amend.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 30, 2025
@rustbot
Copy link
Collaborator

rustbot commented Nov 30, 2025

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@clarfonthey
Copy link
Contributor

clarfonthey commented Nov 30, 2025

Rather than manually unrolling the code here, it might be worth considering if the original implementation can be made const instead:

let lower_bound = index.saturating_sub(3);
let new_index = self.as_bytes()[lower_bound..=index]
.iter()
.rposition(|b| b.is_utf8_char_boundary());

I had this exact thought in the original implementation, which is why I explicitly only checked the four following bytes for a character boundary. Note that I haven't fully scrutinised the code and yours may be better regardless, but I figured I'd mention this for some extra context. I would expect LLVM to automatically unroll a 4-max-iteration loop in most contexts.

Comment on lines +416 to +439
let bytes = self.as_bytes();
let boundary_index = if bytes[index].is_utf8_char_boundary() {
index
} else {
let mut i = index;
while i > 0 {
if self.as_bytes()[i].is_utf8_char_boundary() {
break;
// SAFETY: `bytes[index]` is a UTF-8 continuation byte, therefore there must be a byte before it.
// Note: `index.unchecked_sub(1)` would be preferable, but it doesn't the remove bounds check
// from `bytes[previous_index]`.
let previous_index = unsafe { index.checked_sub(1).unwrap_unchecked() };
if bytes[previous_index].is_utf8_char_boundary() {
previous_index
} else {
// SAFETY: `bytes[index - 1]` is a UTF-8 continuation byte, therefore there must be a byte before it
let previous_previous_index = unsafe { index.checked_sub(2).unwrap_unchecked() };
if bytes[previous_previous_index].is_utf8_char_boundary() {
previous_previous_index
} else {
// UTF-8 character sequences are at most 4 bytes long.
// `bytes[index - 2]`, `bytes[index - 1]`, and `bytes[index]` are all continuation bytes,
// therefore `bytes[index - 3]` must be the 1st byte in a 4-byte UTF-8 character sequence.
debug_assert!(bytes[index - 3].is_utf8_char_boundary());
index - 3
}
i -= 1;
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deep nesting isn't ideal. You could try matching slice patterns:

let bytes = &self.as_bytes()[..=index];
let boundary_index = match bytes {
    [.., b] if b.is_utf8_char_boundary() => index,
    [.., b, _] if b.is_utf8_char_boundary() => index - 1,
    [.., b, _, _] if b.is_utf8_char_boundary() => index - 2,
    [.., b, _, _, _] if b.is_utf8_char_boundary() => index - 3,
    // SAFETY: TODO
    _ => unsafe {
        debug_assert!(bytes[index.saturating_sub(3)].is_utf8_char_boundary());
        unreachable_unchecked()
    }
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer. I'll check if it optimizes equally as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely more elegant, but it's definitely non-obvious to me that it could optimize well, since each arm has bounds check requirements that cannot be deduced from the previous.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, a failed bounds check implies the following ones also fail. That means that any bounds check that fails leads to the fallback arm, which is unreachable_unchecked. So, yes, it's extra steps compared to get_unchecked, but LLVM seems to deduce that fine: https://godbolt.org/z/jr48rjh9h (foo and bar get merged)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...oh, no, I was wrong. The unreachable_unchecked nukes any other checks. Cool!

(The scratch pad I was using: https://godbolt.org/z/46Msq6G85)

@okaneco
Copy link
Contributor

okaneco commented Nov 30, 2025

You might want to add a codegen test under tests/codegen-llvm to make sure the panics/slice fail checks that you're targeting are optimized out.

At the time, I tried to add a counter variable to limit the number of iterations but I wasn't able to remove the panics like the current implementation does.

slice_new and truncate_new still have panic paths when the index is variable. Variable index slice doesn't have a panic in the current implementation. Variable truncate has a panic for both implementations.
https://godbolt.org/z/T1TexYbE7

I wrote an example codegen test to help with testing this out (others may be able to improve it further).
You can run the tests with ./x test [path to file or directory] (save time by iterating over the IR/assembly in godbolt first).
Again, the variable truncate may not be possible to enable currently.
https://rustc-dev-guide.rust-lang.org/tests/compiletest.html#codegen-tests

Example codegen test file

// Test that panics and slice errors are elided from `floor_char_boundary` implementation.

//@ compile-flags: -Copt-level=3

#![crate_type = "lib"]

// CHECK-LABEL: @floor_fixed_index
#[no_mangle]
pub fn floor_fixed_index(s: &str) -> usize {
    // CHECK-NOT: panic
    s.floor_char_boundary(20)
}

// CHECK-LABEL: @floor_variable_index
#[no_mangle]
pub fn floor_variable_index(s: &str, i: usize) -> usize {
    // CHECK-NOT: panic
    s.floor_char_boundary(i)
}

// CHECK-LABEL: @floor_slice_fixed_index
#[no_mangle]
pub fn floor_slice_fixed_index(s: &str) -> &str {
    // CHECK-NOT: panic
    // CHECK-NOT: slice_error_fail
    let index = s.floor_char_boundary(20);
    &s[..index]
}

// CHECK-LABEL: @floor_slice_variable_index
#[no_mangle]
pub fn floor_slice_variable_index(s: &str, i: usize) -> &str {
    // CHECK-NOT: panic
    // CHECK-NOT: slice_error_fail
    let index = s.floor_char_boundary(i);
    &s[..index]
}

// CHECK-LABEL: @truncate_from_floor_fixed_index
#[no_mangle]
pub fn truncate_from_floor_fixed_index(s: &mut String) {
    // CHECK-NOT: panic
    let index = s.floor_char_boundary_new(20);
    s.truncate(index)
}

// CHECK-LABEL: @truncate_from_floor_variable_index
#[no_mangle]
pub fn truncate_from_floor_variable_index(s: &mut String, i: usize) {
    // CHECK-NOT: panic
    let index = s.floor_char_boundary_new(i);
    s.truncate(index)
}

@overlookmotel
Copy link
Contributor Author

slice_new and truncate_new still have panic paths when the index is variable. Variable index slice doesn't have a panic in the current implementation. Variable truncate has a panic for both implementations.

Yes, I noticed this today. Am hoping to eliminate all these panics. I feel like it should be possible, but... mysteries of the compiler...

@scottmcm
Copy link
Member

scottmcm commented Dec 6, 2025

  • I tried using index.unchecked_sub(1), but found it doesn't remove bounds checks, whereas index.checked_sub(1).unwrap_unchecked() does. Surprising!

This is the long-frustrating problem that sub nuw i32 %1, 1 gets turned into add i32 %1, -1 by LLVM and thus completely loses the "it's not going to overflow" information. (cc @nikic in case this real-life example is helpful)

Whereas x.check_sub(1).unwrap_unchecked() ends up with an llvm.assume(x >= 1) which doesn't get removed.

Comment on lines +421 to +422
// Note: `index.unchecked_sub(1)` would be preferable, but it doesn't the remove bounds check
// from `bytes[previous_index]`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me that it's better to do an unwrap_unchecked on the subtraction than just to do a get_unchecked on the array, if the goal is to remove the bounds checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, this https://godbolt.org/z/d155x5hqj seems to work too:

impl StrExt for str {
    #[inline]
    fn floor_char_boundary_new(&self, index: usize) -> usize {
        if index >= self.len() {
            return self.len();
        }

        let bytes = self.as_bytes();
        let boundary_index = unsafe {
            if bytes.get_unchecked(index).is_utf8_char_boundary() {
                index
            } else if bytes.get_unchecked(index - 1).is_utf8_char_boundary() {
                index - 1
            } else if bytes.get_unchecked(index - 2).is_utf8_char_boundary() {
                index - 2
            } else if bytes.get_unchecked(index - 3).is_utf8_char_boundary() {
                index - 3
            } else {
                unreachable_unchecked()
            }
        };       

        // Inform compiler that returned index is `<= index` and on a char boundary.
        // This removes bounds check from e.g. `String::truncate` call after this.
        // While it looks redundant, it seems to make a difference as load of
        // `bytes[index - 3]` is actually removed above, not kept for the assume.
        // SAFETY: Calculations above only deduct from `index`, and cannot wrap around.
        // `boundary_index` is a char boundary.
        unsafe {
            assert_unchecked(bytes.get_unchecked(boundary_index).is_utf8_char_boundary());
        }

        boundary_index
    }
}

Copy link
Member

@scottmcm scottmcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That godbolt demo is quite persuasive to me that this should work differently, but I think the critical thing missing is a codegen-llvm test that confirms that the new implementation doesn't have those problems -- because without that, it's just going to get broken again like it did before.

View changes since this review

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 6, 2025
@rustbot
Copy link
Collaborator

rustbot commented Dec 6, 2025

Reminder, once the PR becomes ready for a review, use @rustbot ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants