Cheerio is a fantastic Node library for parsing HTML and then being able to manipulate and serialize it. But you can also just use it for parsing HTML and plucking out what you need. We use that to prepare the text that goes into our search index for our site. It basically works like this:
const body = await getBody('http://localhost:4002' + eachPage.path)
const $ = cheerio.load(body)
const title = $('h1').text()
const intro = $('p.intro').text()
...
But it hit me, can we speed that up? cheerio
actually ships with two different parsers:
One is faster and one is more strict.
But I wanted to see this in a real-world example.
So I made two runs where I used:
const $ = cheerio.load(body)
in one run, and:
const $ = cheerio.load(body, { xmlMode: true })
in another.
After having parsed 1,635 pages of HTML of various sizes the results are:
FILE: load.txt MEAN: 13.19457640586797 MEDIAN: 10.5975 FILE: load-xmlmode.txt MEAN: 3.9020372860635697 MEDIAN: 3.1020000000000003
So, using {xmlMode:true}
leads to roughly a 3x speedup.
I think it pretty much confirms the original benchmark, but now I know based on a real application.
Comments