Skip to content

kylehowells/swift-justhtml

Repository files navigation

Swift-JustHTML

A dependency-free HTML5 parser for Swift, following the WHATWG HTML parsing specification.

Swift port of justhtml (Python) and justjshtml (JavaScript).

Features

  • Full HTML5 Compliance - Passes all 1,831 html5lib tree construction tests
  • Zero Dependencies - Pure Swift using only standard library and Foundation
  • Cross-Platform - macOS, iOS, tvOS, watchOS, visionOS, and Linux
  • CSS Selectors - Query documents using standard CSS selector syntax
  • Multiple Output Formats - Serialize to HTML, plain text, or Markdown
  • Streaming API - Memory-efficient event-based parsing
  • Fragment Parsing - Parse HTML fragments in specific contexts

Installation

Swift Package Manager

Add swift-justhtml to your Package.swift:

dependencies: [
    .package(url: "https://github.com/kylehowells/swift-justhtml.git", from: "0.3.0")
]

Then add it to your target's dependencies:

targets: [
    .target(
        name: "YourTarget",
        dependencies: ["justhtml"]
    )
]

Xcode

  1. File > Add Package Dependencies...
  2. Enter: https://github.com/kylehowells/swift-justhtml.git
  3. Select version: 0.3.0 or later

Usage

Basic Parsing

import justhtml

// Parse an HTML document
let doc = try JustHTML("<html><body><p>Hello, World!</p></body></html>")

// Access the document tree
print(doc.root.children)  // [<html>]

CSS Selectors

// Query with CSS selectors
let paragraphs = try doc.query("p")
let byClass = try doc.query(".intro")
let byId = try doc.query("#header")
let complex = try doc.query("nav > ul > li > a[href]")

// Check if a node matches a selector
let matches = try node.matches(".highlight")

Extracting Content

// Get plain text
let text = doc.toText()

// Serialize to HTML
let html = doc.toHTML()
let prettyHtml = doc.toHTML(pretty: true, indentSize: 4)

// Convert to Markdown
let markdown = doc.toMarkdown()

Fragment Parsing

// Parse HTML as if inside a specific element
let ctx = FragmentContext("tbody")
let fragment = try JustHTML("<tr><td>Cell</td></tr>", fragmentContext: ctx)

Streaming API

// Memory-efficient event-based parsing
for event in HTMLStream("<p>Hello</p>") {
    switch event {
    case .start(let tag, let attrs):
        print("Start: \(tag)")
    case .end(let tag):
        print("End: \(tag)")
    case .text(let content):
        print("Text: \(content)")
    case .comment(let text):
        print("Comment: \(text)")
    case .doctype(let name, let publicId, let systemId):
        print("Doctype: \(name)")
    }
}

Error Handling

// Strict mode - throws on first parse error
do {
    let doc = try JustHTML("<p>Unclosed", strict: true)
} catch let error as StrictModeError {
    print("Error: \(error.parseError.code)")
}

// Collect errors without throwing
let doc = try JustHTML("<p>Unclosed", collectErrors: true)
for error in doc.errors {
    print("\(error.line):\(error.column): \(error.code)")
}

Spec Compliance

swift-justhtml implements the WHATWG HTML parsing specification exactly and passes all tests from the official html5lib-tests suite (used by browser vendors), the same as justhtml.

Test Results

Test Suite Passed Failed
Tree Construction 1,831 0
Tokenizer 6,810 0
Serializer 230 0
Encoding 82 0
Total 8,953 0

Fuzz Testing

The parser has been fuzz tested with millions of randomized and malformed HTML documents to ensure it never crashes or hangs on any input:

  • Random data fuzzing with varying document sizes
  • Fragment context fuzzing
  • Deep nesting stress tests
  • Malformed tag and entity sequences

Run the fuzzer: swift test --filter fuzzTest

Performance

swift-justhtml is optimized for performance, matching or exceeding JavaScript implementations:

Parse Time

Implementation Parse Time Comparison
Swift 97ms -
JavaScript 99ms 1.02x slower
Python 398ms 4.1x slower

Benchmark: Parsing 2.5MB of HTML across 5 Wikipedia articles

See Benchmarks/BENCHMARK_RESULTS.md for detailed performance comparison.

Memory Usage

Implementation Peak RSS Comparison
Swift 103 MB -
Python 106 MB 1.03x more
JavaScript 226 MB 2.2x more

Benchmark: Average peak memory across 6 test files including 20MB synthetic HTML

See Benchmarks/MEMORY_RESULTS.md for detailed memory comparison.

Swift Library Comparison

Library html5lib Pass Rate Crashes/Hangs Dependencies
swift-justhtml 100% (1831/1831) None None
Kanna 94.4% (1542/1633) None libxml2
SwiftSoup 87.9% (1436/1633) Infinite loop on 197 tests swift-atomics
LilHTML 47.4% (775/1634) Crashes on 855 tests libxml2

See notes/comparison.md for detailed library comparison.

Platform Support

Platform Minimum Version
macOS 13.0+
iOS 16.0+
tvOS 16.0+
watchOS 9.0+
visionOS 1.0+
Linux Swift 6.0+

Documentation

License

MIT License - see LICENSE for details.

Credits