A History of HTML Parsing at Cloudflare: Part 1

0 Comments

To coincide with the launch of streaming HTML rewriting functionality for Cloudflare Workers we are open sourcing the Rust HTML rewriter (LOL HTML) used to back the Workers HTMLRewriter API. We also thought it was about time to review the history of HTML rewriting at Cloudflare.

The first blog post will explain the basics of a streaming HTML rewriter and our particular requirements. We start around 8 years ago by describing the group of ‘ad-hoc’ parsers that were created with specific functionality such as to rewrite e-mail addresses or minify HTML. By 2016 the state machine defined in the HTML5 specification could be used to build a single spec-compliant HTML pluggable rewriter, to replace the existing collection of parsers. The source code for this rewriter is now public and available here: https://github.com/cloudflare/lazyhtml

The second blog post will describe the next iteration of rewriter. With the launch of the edge compute platform Cloudflare Workers we came to realise that developers wanted the same HTML rewriting capabilities with a JavaScript API. The post describes the thoughts behind a low latency streaming HTML rewriter with a CSS-selector based API. We open-sourced the Rust library as it can also be used as a stand-alone HTML rewriting/parsing library.

How useful was this post?

Click on a star to rate it!

Average rating / 5. Vote count:

No votes so far! Be the first to rate this post.

We are sorry that this post was not useful for you!

Let us improve this post!

Categories: