HTML diffs

Challenges

There are three challenges involved in creating HTML diffs:

  1. Parse the HTML properly
  2. Use a suitable diff algorithm
  3. Express the results as a sensible HTML tree

Sensible HTML is a sliding scale as well as an oxymoron.

A related problem is ‘blame’ in HTML. (You can actually implement diff in terms of suitable blame primitives, which is probably the best way to go about it.)

Capturing the results after Step 2 may also be useful.

Failures

The lxml.diff module fails Step 3 at least: see Bug #315511.

Candid Dauth's htmldiff also produces malformed HTML:

Old:

<p><img src="test1.jpg" /></p>

New:

<p><img src="test2.jpg" /></p>

Output:

<span class="diff-html-removed" id="removed-htmldiff-0"><img
src="test1.jpg"></img></span><p><span class="diff-html-added"
id="added-htmldiff-0"><img src="test2.jpg"></img></span></p>

This ought to fail a Schematron-style DOM check.

Aaron Swartz’s htmldiff.py also produces marvellously chox output:

Old:

<p class=a>example</p>

New:

<p class=b>example</p>

Output:

<del class="diff modified"><p class=a></del><ins class="diff modified"><p class=b></ins>example</p>

For the above <p><img> case, the Bicking–Cyganiak htmldiff fails to recognize any change in the document at all. It does rather well at the case which lxml.html.diff failed so badly on.

Useful output

It isn't clear what useful HTML output an HTML diff tool should produce in some cases. For example, when a <title> element changes or attributes change, how should the output indicate that? This implies that rendering of an HTML diff is quite environment specific, whereas generation of the raw diff that powers that is a general problem.

Sometimes granularity is a problem too. For example, if lexical attribute order changes but the actual attribute name and value mappings do not, then should that be captured? And if so, how?