-
Notifications
You must be signed in to change notification settings - Fork 16
Add faithful translation mode #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the contribution! This PR is quite large, reviewing! |
src/element_handler/mod.rs
Outdated
| // instances of two or more newlines with a newline, followed by escaped | ||
| // newlines. Tricky part: replace LFCR differently than CR. Probably with | ||
| // groups. | ||
| Regex::new(r#"(\r?\n\s*)(\r?\n\s*)"#).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get rid of regex? We just got pr #49 for removing regex. It would be nice if regex could be removed or opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! But didn't see the commit?
src/element_handler/mod.rs
Outdated
| Regex::new(r#"(\r?\n\s*)(\r?\n\s*)"#).unwrap()) | ||
| .replace_all(&s, |caps: &Captures| { | ||
| caps[1].to_string() | ||
| + &(caps[2].replace("\r", " ").replace("\n", " ")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two HTML entities seem to have been swapped? https://developer.mozilla.org/en-US/docs/Glossary/CRLF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Thanks. Fixed!
| } | ||
|
|
||
| #[test] | ||
| fn code_blocks_with_lang_class_on_pre_tag() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the faithful mode break this test? Should the default convert() still work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is a special test case in #14. Some libraries don't write the language class in the code tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for pointing this out! I'll revert these changes to the test and run it in pure mode instead. I'll also add a link to #14 to explain the purpose of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this file unchanged? So we can easily pull the latest tests from https://github.com/mixmark-io/turndown/blob/master/test/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree that we should be using tests from turndown. However:
- The tests don't cover some important cases; they're also wrong in some cases. For example, 2d5cdcc fixes Turndown's use of
_for emphasis, since it doesn't work inside words; the correct option is to use*; see Emphasis doesn't work within a word with_mixmark-io/turndown#441. - The tests are updated rarely. There's been one commit changing it this year, another commit last year, then the next commit is from 2020.
- The changes I introduced are easy to merge in.
So, I'd suggest merging in changes when they are (infrequently) made. I re-reviewed the latest from Turndown and saw one test I'd accidentally removed; the other changes were errors, poorer test coverage, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasonable. Perhaps we can simply use pure mode in Rust for the turndown test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually found several bugs using faithful mode on the Turndown test cases; only a few are really pure mode only. So, I'd prefer to keep these tests. In general, faithful mode is a stricter test and helps point out bugs that pure mode ignores.
|
Thanks for the review! I've pushed changes that addresses two of the comments. Still thinking about replacing regex. |
|
I pushed a commit that replaced |
Oops, I realized I didn't push it. Pushed regex fixes plus test fixes. |
|
All look good to me, Merged! Thanks again for the work! |
|
Thanks for the merge! I'm really excited to see this feature in htmd -- it's unique and provides some neat capabilities. |

This PR adds what I call "faithful mode" -- if the HTML to translate can't be exactly captured by Markdown, then this mode outputs the HTML instead. For example, the HTML snippet
<em>Hello</em>becomes*Hello*, while<span id="foo">Hello</span>becomesHelloin the default translation mode (pure), but stays as<span id="foo">Hello</span>in faithful translation mode.