Skip to content

Conversation

@notJoon
Copy link
Contributor

@notJoon notJoon commented Jun 8, 2024

Description

  1. The Debug implementation for Match has been updated to use DebugHaystack. This provides a way to handle the formatting of &[u8] for debug output.
  • Valid UTF-8 characters are output as is.
  • Invalid UTF-8 bytes are output as hex escape sequences (\xHH).
  • ASCII escape characters (e.g., \t, \n) are properly escaped.
  1. Additional test cases have been added

fmt.field("bytes", &s);

let bytes = self.as_bytes();
let formatted = bytes_to_string_with_invalid_utf8_escaped(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use regex_automata::util::escape::DebugHaystack instead? It will basically do what you have here, but will only escape invalid UTF-8. What you've implemented here will escape not only invalid UTF-8, but all UTF-8 that isn't ASCII. (I think that would be a cure worse than the disease.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to use DebugHaystack. I thought there would be such a feature but couldn't find it. Thanks for your suggestion. 88112b3

debug_str,
r#"Match { start: 7, end: 13, bytes: "\\xFFworld" }"#
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests with non-ASCII UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added along with other tests.
d18841e

fn bytes_to_string_with_invalid_utf8_escaped(bytes: &[u8]) -> String {
let mut result = String::new();
for &byte in bytes {
if byte.is_ascii() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs valid UTF-8 characters as is

This is why what you said isn't accurate here. This only outputs ASCII characters as-is. Everything else, including valid UTF-8 that isn't ASCII, is emitted as escape byte sequences.

@notJoon notJoon requested a review from BurntSushi June 9, 2024 01:11
Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@BurntSushi BurntSushi merged commit 1f9f9cc into rust-lang:master Jun 9, 2024
@BurntSushi
Copy link
Member

This PR is on crates.io in regex 1.10.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants