The moving parts

html5lib consists of a number of components, which are responsible for handling its features.

Parsing uses a tree builder to generate a tree, the in-memory representation of the document. Several tree representations are supported, as are translations to other formats via tree adapters. The tree may be translated to a token stream with a tree walker, from which HTMLSerializer produces a stream of bytes. The token stream may also be transformed by use of filters to accomplish tasks like sanitization.

Tree builders

The parser reads HTML by tokenizing the content and building a tree that the user can later access. html5lib can build three types of trees:

  • etree - this is the default; builds a tree based on xml.etree.ElementTree, which can be found in the standard library. Whenever possible, the accelerated ElementTree implementation (i.e. xml.etree.cElementTree on Python 2.x) is used.
  • dom - builds a tree based on xml.dom.minidom.
  • lxml - uses the lxml.etree implementation of the ElementTree API. The performance gains are relatively small compared to using the accelerated ElementTree module.

You can specify the builder by name when using the shorthand API:

import html5lib
with open("mydocument.html", "rb") as f:
    lxml_etree_document = html5lib.parse(f, treebuilder="lxml")

To get a builder class by name, use the getTreeBuilder() function.

When instantiating a HTMLParser object, you must pass a tree builder class via the tree keyword attribute:

import html5lib
TreeBuilder = html5lib.getTreeBuilder("dom")
parser = html5lib.HTMLParser(tree=TreeBuilder)
minidom_document = parser.parse("<p>Hello World!")

The implementation of builders can be found in html5lib/treebuilders/.

Tree walkers

In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it. html5lib provides walkers for etree, dom, and lxml trees, as well as genshi markup streams.

The implementation of walkers can be found in html5lib/treewalkers/.

html5lib provides HTMLSerializer for generating a stream of bytes from a token stream, and several filters which manipulate the stream.

HTMLSerializer

The serializer lets you write HTML back as a stream of bytes.

>>> import html5lib
>>> element = html5lib.parse('<p xml:lang="pl">Witam wszystkich')
>>> walker = html5lib.getTreeWalker("etree")
>>> stream = walker(element)
>>> s = html5lib.serializer.HTMLSerializer()
>>> output = s.serialize(stream)
>>> for item in output:
...   print("%r" % item)
'<p'
' '
'xml:lang'
'='
'pl'
'>'
'Witam wszystkich'

You can customize the serializer behaviour in a variety of ways. Consult the HTMLSerializer documentation.

Filters

html5lib provides several filters:

To use a filter, simply wrap it around a token stream:

>>> import html5lib
>>> from html5lib.filters import sanitizer
>>> dom = html5lib.parse("<p><script>alert('Boo!')", treebuilder="dom")
>>> walker = html5lib.getTreeWalker("dom")
>>> stream = walker(dom)
>>> clean_stream = sanitizer.Filter(stream)

Tree adapters

Tree adapters can be used to translate between tree formats. Two adapters are provided by html5lib:

Encoding discovery

Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:

  • The encoding may be explicitly specified by passing the name of the encoding as the encoding parameter to the parse() method on HTMLParser objects.
  • If no encoding is specified, the parser will attempt to detect the encoding from a <meta> element in the first 512 bytes of the document (this is only a partial implementation of the current HTML specification).
  • If no encoding can be found and the chardet library is available, an attempt will be made to sniff the encoding from the byte pattern.
  • If all else fails, the default encoding will be used. This is usually Windows-1252, which is a common fallback used by Web browsers.