I've just quickly put together a little script that computes the difference between two texts in a human readable format. The result when you run diff is a bit difficult to understand for a human being and I wanted something more "humane" that quickly summarises what's different on one simple line. Eg. "Added 2 lines, change 1 line".
This little script is going to be part an undo function in our new CMS that I'm working on. Instead of just pinpointing which revision date you want to go back to you'll also be able to see what the differences were between each revision in the undo history for the CMS.
It's important to note that my target usage is for a CMS where the texts to compare are average chunks of HTML. The script works like this:
>>> from humanreadablediff import compare
>>> before = open('version1.1.txt').read()
>>> after = open('version1.2.txt').read()
>>> compare(before, before)
No difference
>>> compare(before, after)
Added 2 lines, removed 1 line
>>> compare(after, before)
Added 1 line, removed 2 lines
To see it in action, use The test page
You can download the it here: humanreadablediff.py
Questions and challenges
It's not so easy to tell apart what is a change and what is a remove+add sometimes. If you for example start with:
Peter
David
Andrew
and change the text to:
Petter
David
Zahid
The result should be "Added 1 line, removed 1 line, changed 1 line", shouldn't it? My script claims to understand and spot that.
Another challenge is of course word wrapping. Imagine a text which is just one long line of about 240 characters. When you view it in a small textarea (typical of a CMS) it will appear to be 3 lines and if you make 3 changes in what to you appears to be three different lines, you'll expect the result "Changed 3 lines". I think I've got that under control too. Have a play with this text and play with the word wrap number.
Comments
I am not sure of your result for Example1.
Gong from:
Peter
David
Andrew
To:
Peter
David
Zahid
Can be done with either:
A.1) Locate the Third line.
A.2) Change the line to be Zahid.
Or:
B.1) Locate the second line.
B.2) remove the next line.
B.3) Add a line after the current, of Zahid.
You could do sequence A followed by sequence B but sequence B after sequence A would not change the text.
So, the correct result for me would be either:
Removed 1 line, added 1 line.
Or:
Changed 1 line.
But not both.
Cheers, Paddy.
Paddy, the line that was changed was the first line from "Petter" to "Peter", so this is OK.
I have more trouble understanding the 0'th example though.
It looks to me like either 2 added and 1 removed OR 2 changed and 1 added. One thing is for sure, the difference shows in THREE lines, so the explanation's numbers need to add up. What am I missing here?
Good point Jan. I'll adjust the distance calculation measure with this an an example. The numbers I used are rough guesses and I wanted to test my way into the best match.
Fixed that 0th example now.