<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Jonathan Hedley</title>
  <link href="https://jhedley.com/feed.xml" rel="self"></link>
  <updated>2025-03-18T00:43:49Z</updated>
  <author>
    <name>Jonathan Hedley</name>
    <uri>https://jhedley.com/</uri>
  </author>
  <id>https://jhedley.com/</id>
  <icon>https://jhedley.com/favicon.ico</icon>
  <entry>
    <title>On Adding HTTP/2 Support to jsoup</title>
    <link href="https://jhedley.com/posts/jsoup-http2"></link>
    <id>https://jhedley.com/posts/jsoup-http2</id>
    <updated>2025-03-18T00:43:49Z</updated>
    <summary>jsoup 1.19.1 adds seamless HTTP/2 support using Java&#39;s HttpClient API, multi-release JARs (MRJARs), and maintains Android compatibility. Notes on the design patterns, testing strategies, and how to enable HTTP/2 in your projects.</summary>
    <content type="html">&lt;div class=&#34;meta pubdate&#34;&gt;&lt;time datetime=&#34;2025-03-08&#34;&gt;Mar 8, 2025&lt;/time&gt;&lt;/div&gt;&#xA;&#xA;&lt;p&gt;I recently released &lt;a href=&#34;//jsoup.org/news/release-1.19.1&#34;&gt;jsoup 1.19.1&lt;/a&gt;, which includes support for making HTTP/2 requests through jsoup’s &lt;a href=&#34;//jsoup.org/apidocs/org/jsoup/Jsoup.html#connect(java.lang.String)&#34;&gt;&lt;code&gt;Connection&lt;/code&gt;&lt;/a&gt; client. HTTP/2 support had been on my mind for a while, and although implementing it proved relatively straightforward, it involved some interesting design and integration work. The resulting pattern will be reusable for adding other future features, so I thought I’d share some notes on the process.&lt;/p&gt;&#xA;&lt;p&gt;As background, jsoup handles HTTP requests using Java’s built-in &lt;a href=&#34;//docs.oracle.com/javase/8/docs/api/java/net/HttpURLConnection.html&#34;&gt;&lt;code&gt;HttpUrlConnection&lt;/code&gt;&lt;/a&gt;, which has existed since time immemorial. When Java introduced HTTP/2 in Java 11, rather than extending the existing &lt;code&gt;HttpUrlConnection&lt;/code&gt; API, it created a completely new &lt;a href=&#34;//openjdk.org/groups/net/httpclient/intro.html&#34;&gt;&lt;code&gt;HttpClient&lt;/code&gt;&lt;/a&gt; interface. This meant existing code couldn’t benefit from HTTP/2 simply by upgrading the JDK. But, the new API did clean up some old warts, and bring a path for future version upgrades.&lt;/p&gt;&#xA;&lt;div class=&#34;notegroup&#34;&gt;&lt;p class=&#34;note&#34;&gt;&lt;a href=&#34;//developer.android.com/studio/write/java11-nio-support-table&#34;&gt;Android NIO desugaring&lt;/a&gt; is a bytecode transformation process that updates syntax and adds to the classpath.&lt;/p&gt;&lt;p&gt;Unfortunately, Android chose not to implement Java’s new &lt;code&gt;HttpClient&lt;/code&gt; interface, despite otherwise robust support for Java 8 and Java 11 language features through their “desugaring” implementation.&lt;/p&gt;&lt;/div&gt;&#xA;&#xA;&lt;p&gt;A key goal in adding HTTP/2 support was to make it a seamless, drop-in upgrade for jsoup users. At the same time, I wanted to maintain backward compatibility, keep support for Java 8 (the minimum required by jsoup), and continue full Android compatibility.&lt;/p&gt;&#xA;&lt;p&gt;So, how to square this circle?&lt;/p&gt;&#xA;&lt;div class=&#34;notegroup&#34;&gt;&lt;p class=&#34;note&#34;&gt;&lt;a href=&#34;//docs.oracle.com/javase/10/docs/specs/jar/jar.html&#34;&gt;Multi-release JAR files&lt;/a&gt; are a feature of the JAR file format that allow for a single JAR file to support multiple major versions of Java platform releases.&lt;/p&gt;&lt;p&gt;Java offers multi-release JARs (“MRJARs”), which allow code partitioning for different Java versions. The JVM&#xA;runtime just ignores classes targeted for newer Java versions than the one it’s running on.&lt;/p&gt;&lt;/div&gt;&#xA;&#xA;&lt;p&gt;To leverage this, I&#xA;refactored jsoup’s &lt;code&gt;HttpConnection&lt;/code&gt; implementation by abstracting out the &lt;code&gt;HttpUrlConnection&lt;/code&gt; logic into a delegate &lt;code&gt;RequestExecutor&lt;/code&gt; abstract class. Then, I provided two implementations: one for &lt;code&gt;HttpUrlConnection&lt;/code&gt;, and another for &lt;code&gt;HttpClient&lt;/code&gt; placed exclusively in the Java 11 version directory. This structure ensures that earlier Java versions continue to compile and run jsoup without issues.&lt;/p&gt;&#xA;&lt;p&gt;To maintain Android compatibility, I created a &lt;code&gt;RequestDispatch&lt;/code&gt; provider that attempts to instantiate the &lt;code&gt;HttpClient&lt;/code&gt; implementation and falls back to &lt;code&gt;HttpUrlConnection&lt;/code&gt; if necessary.&lt;/p&gt;&#xA;&lt;p&gt;You can see it come together in the Pull Request: &lt;a href=&#34;//github.com/jhy/jsoup/pull/2257&#34;&gt;Support HTTP/2 requests via HttpClient&lt;/a&gt;.&lt;/p&gt;&#xA;&lt;div class=&#34;tree card&#34;&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;java/&#xA;    &lt;ul&gt;&#xA;        &lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/HttpConnection.java&#34;&gt;HttpConnection&lt;/a&gt;&lt;/li&gt;&#xA;        &lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/RequestDispatch.java&#34;&gt;RequestDispatch&lt;/a&gt;&lt;/li&gt;&#xA;        &lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/RequestExecutor.java&#34;&gt;RequestExecutor&lt;/a&gt;&lt;/li&gt;&#xA;        &lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/UrlConnectionExecutor.java&#34;&gt;UrlConnectionExecutor&lt;/a&gt;&lt;/li&gt;&#xA;    &lt;/ul&gt;&#xA;&lt;/li&gt;&lt;li&gt;java11/&#xA;    &lt;ul&gt;&#xA;        &lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/blob/master/src/main/java11/org/jsoup/helper/HttpClientExecutor.java&#34;&gt;HttpClientExecutor&lt;/a&gt;&lt;/li&gt;&#xA;    &lt;/ul&gt;&#xA;&lt;/li&gt;&lt;/ul&gt;&#xA;&lt;/div&gt;&#xA;&lt;p&gt;One significant advantage of HTTP/2 for clients is faster repeated requests to the same host. In Java’s &lt;code&gt;HttpClient&lt;/code&gt;, this is implemented by maintaining a connection pool in a &lt;code&gt;HttpClient&lt;/code&gt; instance. I integrated this capability into jsoup’s existing &lt;a href=&#34;//jsoup.org/apidocs/org/jsoup/Jsoup.html#newSession()&#34;&gt;&lt;code&gt;Connection.newSession()&lt;/code&gt;&lt;/a&gt; method, as seen in &lt;a href=&#34;//github.com/jhy/jsoup/commit/a62c7f37d4132f3180bfef3e0d872d8e3cf87b5e&#34;&gt;this commit&lt;/a&gt;.&lt;/p&gt;&#xA;&lt;h2 id=&#34;testing-the-implementation&#34;&gt;Testing the Implementation&lt;/h2&gt;&#xA;&lt;p&gt;jsoup already has comprehensive unit and integration tests for its connection code, and I wanted to ensure that the new &lt;code&gt;HttpClient&lt;/code&gt; implementation had at least the same coverage. This proved very simple to implement; just adding test classes (in the 11 version) that extend the existing tests classes (which then execute all the existing tests in a new context), and use a setup method to toggle-on the HttpClient. And other than a few implementation nuances, all the existing paths pass correctly.&lt;/p&gt;&#xA;&lt;p&gt;You can see the testing approach here:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/blob/master/src/test/java/org/jsoup/integration/ConnectTest.java&#34;&gt;Integration tests&lt;/a&gt;&lt;/li&gt;&#xA;&lt;li&gt;&lt;a href=&#34;//github.com/jhy/jsoup/pull/2257/files#diff-5f07253b5c10d568299a7ccd25b803b7a020148b7f4f874b94dfd808350b8c04R1&#34;&gt;HttpClient-specific tests&lt;/a&gt;&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;I found using &lt;a href=&#34;//github.com/jenv/jenv&#34;&gt;&lt;code&gt;jenv&lt;/code&gt;&lt;/a&gt; invaluable to flip between different Java environments and verify everything worked OK.&lt;/p&gt;&#xA;&lt;hr/&gt;&#xA;&lt;p&gt;Given the significance of this change, the HTTP/2 feature in jsoup 1.19.1 is currently gated behind an opt-in flag (&lt;code&gt;jsoup.useHttpClient&lt;/code&gt;). Users can enable it to test and uncover potential integration issues. I’ve been running it in production (e.g., on &lt;a href=&#34;//try.jsoup.org/&#34;&gt;Try jsoup&lt;/a&gt;) and have found it stable and performant. Once it’s had more time to bake, I’ll enable it by default in future releases.&lt;/p&gt;&#xA;&lt;div class=&#34;notegroup&#34;&gt;&lt;p class=&#34;note&#34;&gt;See the &lt;a href=&#34;//en.wikipedia.org/wiki/HTTP/3&#34;&gt;Wikipedia&lt;/a&gt; article for more details on HTTP/3.&lt;br/&gt;&lt;br/&gt;A good backgrounder on the improvements between 1.1, 2, and 3 is: &lt;i&gt;&lt;a href=&#34;//httptoolkit.com/blog/http3-quic-open-source-support-nowhere/&#34;&gt;HTTP/3 is everywhere but nowhere&lt;/a&gt;&lt;/i&gt;.&lt;/p&gt;&lt;p&gt;There’s also a Java Enhancement Proposal (JEP) underway to &lt;a href=&#34;//bugs.openjdk.org/browse/JDK-8291976&#34;&gt;introduce HTTP/3 support to Java’s &lt;code&gt;HttpClient&lt;/code&gt;&lt;/a&gt;. This will enable jsoup to easily upgrade to HTTP/3 once it’s released.&lt;/p&gt;&lt;/div&gt;&#xA;&#xA;&lt;h2 id=&#34;try-it-out&#34;&gt;Try it out!&lt;/h2&gt;&#xA;&lt;p&gt;To enable HTTP/2 requests, make sure you upgrade to &lt;a href=&#34;//jsoup.org/download&#34;&gt;jsoup 1.19.1&lt;/a&gt;, and add this line to your request code:&lt;/p&gt;&#xA;&lt;pre class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;kd&#34;&gt;static&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;    &lt;/span&gt;&lt;span class=&#34;n&#34;&gt;System&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;.&lt;/span&gt;&lt;span class=&#34;na&#34;&gt;setProperty&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;(&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;jsoup.useHttpClient&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;&lt;span class=&#34;w&#34;&gt; &lt;/span&gt;&lt;span class=&#34;s&#34;&gt;&amp;#34;true&amp;#34;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;);&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;w&#34;&gt;&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;&lt;span class=&#34;w&#34;&gt;&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;This will enable HTTP/2 support for all requests made through jsoup.&lt;/p&gt;&#xA;&lt;p&gt;If you find any issues, please report them on the &lt;a href=&#34;//github.com/jhy/jsoup/issues?q=sort%3Acreated-desc&#34;&gt;jsoup issue tracker&lt;/a&gt;.&lt;/p&gt;&#xA;&#xA;    </content>
  </entry>
  <entry>
    <title>jsoup SAX + DOM StreamParser</title>
    <link href="https://jhedley.com/links/streamparser-sax-cookbook"></link>
    <id>https://jhedley.com/links/streamparser-sax-cookbook</id>
    <updated>2025-03-02T04:11:30Z</updated>
    <summary>StreamParser is a hybrid Java SAX + DOM parser in jsoup, allowing efficient, incremental parsing without high memory usage.</summary>
    <content type="html">&lt;div class=&#34;meta pubdate&#34;&gt;&lt;time datetime=&#34;2025-03-02&#34;&gt;Mar 2, 2025&lt;/time&gt;&lt;/div&gt;&#xA;&lt;p&gt;I’ve added a cookbook example for jsoup’s StreamParser, that demonstrates how to use this hybrid &lt;a href=&#34;https://jsoup.org/cookbook/input/streamparser-dom-sax&#34;&gt;Java SAX + DOM parser&lt;/a&gt;. Based on examples from the original &lt;a href=&#34;https://github.com/jhy/jsoup/pull/2096&#34;&gt;PR&lt;/a&gt;.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;h2 id=&#34;problem&#34;&gt;Problem&lt;/h2&gt;&#xA;&lt;p&gt;You need to parse an HTML or XML document that is too large to fit entirely into memory, or you want to process elements progressively as they are encountered. A typical use case is extracting specific elements from a large document, or handling streamed HTML from a network source efficiently.&lt;/p&gt;&#xA;&lt;p&gt;Traditional Java SAX parsers offer efficient streaming parsing for XML and HTML, but they lack an ergonomic way to traverse or manipulate elements like a DOM parser. Meanwhile, standard DOM parsers, such as &lt;code&gt;Jsoup.parse()&lt;/code&gt;, require loading the entire document into memory, which may be inefficient for large files.&lt;/p&gt;&#xA;&lt;h2 id=&#34;solution&#34;&gt;Solution&lt;/h2&gt;&#xA;&lt;p&gt;Use the &lt;a href=&#34;https://jsoup.org/apidocs/org/jsoup/parser/StreamParser.html&#34;&gt;&lt;code&gt;StreamParser&lt;/code&gt;&lt;/a&gt;, which allows you to parse parsing an HTML or XML document in an event driven hybrid DOM + SAX style. Elements are emitted as they are completed, enabling efficient memory use and incremental processing. This hybrid approach allows you to process elements as they arrive, including their children and ancestors, while still leveraging jsoup’s intuitive API.&lt;/p&gt;&#xA;&lt;p&gt;This makes StreamParser a viable alternative to traditional SAX parsers while providing a more ergonomic and familiar API. And &lt;a href=&#34;https://jsoup.org/&#34;&gt;jsoup’s robust handling of malformed HTML and XML&lt;/a&gt; ensures that real-world documents can be processed effectively.&lt;/p&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;&lt;a href=&#34;https://jsoup.org/cookbook/input/streamparser-dom-sax&#34;&gt;Link&lt;/a&gt;&lt;/p&gt;&#xA;&#xA;    </content>
  </entry>
  <entry>
    <title>Convert Microsoft Word to Plain Text</title>
    <link href="https://jhedley.com/tools/convert-word-to-plain-text"></link>
    <id>https://jhedley.com/tools/convert-word-to-plain-text</id>
    <updated>2025-03-02T04:09:49Z</updated>
    <summary>Convert Microsoft Word text to clean plain text. Fix smart quotes, em-dashes, ellipses, and more. Simple, fast, and web-friendly.</summary>
    <content type="html">&lt;div class=&#34;meta pubdate&#34;&gt;&lt;time datetime=&#34;2004-05-04&#34;&gt;May 4, 2004&lt;/time&gt;&lt;/div&gt;&#xA;&lt;!-- BEGIN SNIPPET --&gt;&#xA;&lt;form name=&#34;wordCleaner&#34; onsubmit=&#34;return false;&#34;&gt; &#xA;&lt;textarea id=&#34;wordInput&#34;&gt;• “Double Quotes”,&#xA;• ‘Single quotes’,&#xA;• Ellipsis …,&#xA;• em-dash —&lt;/textarea&gt; &#xA;&lt;textarea id=&#34;wordOutput&#34;&gt;&lt;/textarea&gt;&lt;br/&gt;&#xA;&lt;p&gt;&lt;button id=&#34;cleanButton&#34;&gt;Clean&lt;/button&gt;&lt;/p&gt;&#xA;&lt;/form&gt;&#xA;&#xA;&lt;!-- END SNIPPET  --&gt;&#xA;&lt;hr/&gt;&#xA;&lt;p&gt;&lt;i&gt;This is a repost of an entry from &lt;a href=&#34;//web.archive.org/web/20040602082728/http://jon.hedley.net/convert-ms-word-to-plain-text&#34;&gt;2004&lt;/a&gt;(!). This&#xA;Word-cleaning functionality is showing up in more and more&#xA;web editors, and most sites actually support UTF-8, but people might still find this useful.&lt;/i&gt;&lt;/p&gt;&#xA;&lt;p&gt;Most of the time when I’m writing content for the web (for this blog, or a forum comment, or whatever), I’ll write in Microsoft Word for the spell check and other features that aren’t in a standard &lt;code&gt;textarea&lt;/code&gt; widget, and then I’ll cut and paste into the form on the site.&lt;/p&gt;&#xA;&lt;p&gt;The problem is that this carries all the high characters (“smart-quotes” and the like) that MS Word makes straight through to the site — and unfortunately many sites still aren’t set up to handle them. They expect plain (“Latin”) text.&lt;/p&gt;&#xA;&lt;p&gt;&lt;b&gt;A solution&lt;/b&gt;: this script converts text copied from MS Word into plain text. Paste your input into the top box,&#xA;press &lt;b&gt;clean&lt;/b&gt;, and the input will be scrubbed and sent to the lower box.&#xA;(If you want to clean up Word HTML, rather than just create plain text, I suggest that you use &lt;a href=&#34;//infohound.net/tidy/&#34; target=&#34;_top&#34;&gt;HTML Tidy&lt;/a&gt; with the “clean” and “Word 2000” boxes checked.)&lt;/p&gt;&#xA;&lt;p&gt;Web Developers: feel free to use this code on your own forms to clean your user’s input (although you’d probably be better off doing it server-side). Just change the &lt;code&gt;onClick()&lt;/code&gt; method to convert the text inplace.&lt;/p&gt;&#xA;&#xA;    </content>
  </entry>
</feed>