Skip to content

Conversation

@bjones1
Copy link

@bjones1 bjones1 commented Oct 28, 2025

This PR adds what I call "faithful mode" -- if the HTML to translate can't be exactly captured by Markdown, then this mode outputs the HTML instead. For example, the HTML snippet <em>Hello</em> becomes *Hello*, while <span id="foo">Hello</span> becomes Hello in the default translation mode (pure), but stays as <span id="foo">Hello</span> in faithful translation mode.

@letmutex
Copy link
Owner

Thanks for the contribution! This PR is quite large, reviewing!

// instances of two or more newlines with a newline, followed by escaped
// newlines. Tricky part: replace LFCR differently than CR. Probably with
// groups.
Regex::new(r#"(\r?\n\s*)(\r?\n\s*)"#).unwrap())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get rid of regex? We just got pr #49 for removing regex. It would be nice if regex could be removed or opt-in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! But didn't see the commit?

Regex::new(r#"(\r?\n\s*)(\r?\n\s*)"#).unwrap())
.replace_all(&s, |caps: &Captures| {
caps[1].to_string()
+ &(caps[2].replace("\r", "&#10;").replace("\n", "&#13;"))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two HTML entities seem to have been swapped? https://developer.mozilla.org/en-US/docs/Glossary/CRLF

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Thanks. Fixed!

}

#[test]
fn code_blocks_with_lang_class_on_pre_tag() {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the faithful mode break this test? Should the default convert() still work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test as written put the class attribute in the wrong place -- here's a screenshot from the CommonMark converter; note the location of the class attribute there:

image

So, this fixes the test to put the class attribute where it belongs.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is a special test case in #14. Some libraries don't write the language class in the code tag.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for pointing this out! I'll revert these changes to the test and run it in pure mode instead. I'll also add a link to #14 to explain the purpose of the test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and pushed.

Copy link
Owner

@letmutex letmutex Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this file unchanged? So we can easily pull the latest tests from https://github.com/mixmark-io/turndown/blob/master/test/index.html

Copy link
Author

@bjones1 bjones1 Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree that we should be using tests from turndown. However:

  1. The tests don't cover some important cases; they're also wrong in some cases. For example, 2d5cdcc fixes Turndown's use of _ for emphasis, since it doesn't work inside words; the correct option is to use *; see Emphasis doesn't work within a word with _ mixmark-io/turndown#441.
  2. The tests are updated rarely. There's been one commit changing it this year, another commit last year, then the next commit is from 2020.
  3. The changes I introduced are easy to merge in.

So, I'd suggest merging in changes when they are (infrequently) made. I re-reviewed the latest from Turndown and saw one test I'd accidentally removed; the other changes were errors, poorer test coverage, etc.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasonable. Perhaps we can simply use pure mode in Rust for the turndown test cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually found several bugs using faithful mode on the Turndown test cases; only a few are really pure mode only. So, I'd prefer to keep these tests. In general, faithful mode is a stricter test and helps point out bugs that pure mode ignores.

@bjones1
Copy link
Author

bjones1 commented Oct 30, 2025

Thanks for the review! I've pushed changes that addresses two of the comments. Still thinking about replacing regex.

@bjones1
Copy link
Author

bjones1 commented Oct 30, 2025

I pushed a commit that replaced regex with equivalent code.

@bjones1
Copy link
Author

bjones1 commented Oct 31, 2025

I pushed a commit that replaced regex with equivalent code.

Oops, I realized I didn't push it. Pushed regex fixes plus test fixes.

@letmutex letmutex merged commit 987e180 into letmutex:main Nov 1, 2025
1 check passed
@letmutex
Copy link
Owner

letmutex commented Nov 1, 2025

All look good to me, Merged! Thanks again for the work!

@bjones1
Copy link
Author

bjones1 commented Nov 2, 2025

Thanks for the merge! I'm really excited to see this feature in htmd -- it's unique and provides some neat capabilities.

@bjones1 bjones1 mentioned this pull request Nov 5, 2025
@bjones1 bjones1 deleted the fixes branch November 10, 2025 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants