Benchmark comparison of Elasticsearch highlighters

Wednesday, Jul 5, 2023
0 comments Elasticsearch

tl;dr; fvh is marginally faster than unified and unified is a bit faster than plain.

When you send a full-text search query to Elasticsearch, you can specify (if and) how it should highlight, with HTML tags, highlights. E.g.


The correct way to index data into <mark>Elasticsearch</mark> with (Python) <mark>elasticsearch</mark>-dsl

Among other configuration options, you can pick one of 3 different highlighter algorithms:

unified (default)
plain
fvh

The last one, fvh, requires that you index more at index-time (in particular to add term_vector="with_positions_offsets" to the mapping). In a previous benchmark I did, the total document size on disk, as described by http://localhost:9200/_cat/indices?v grew by 38%.

I bombarded my local Elasticsearch 7.7 instance with thousands of queries collected from logs. Some single-word, some multi-word. The fields it highlights are things like title (~5-50 words) and body (~100-2,000 words).
Basically, I edited the search query by testing one at a time. For example:


search_query = search_query.highlight(
-   "title", fragment_size=120, number_of_fragments=1, type="unified"
+   "title", fragment_size=120, number_of_fragments=1, type="plain"
)

...etc.

After doing 1,000 searches 3 different times per each highlighter type option, and recording the times it took I recorded the following:

(milliseconds per query, lower is better)

UNIFIED:
  MEAN  18.1ms
  MEDIAN 19.0ms

PLAIN:
  MEAN  24.5ms
  MEDIAN 27.5ms

FVH:
  MEAN  16.1ms
  MEDIAN 17.6ms

Thin marginal win for fvh over unified.

Conclusion

Conclusion? Or should I say "Caveats" instead? There's a lot more to it than raw performance speed. In this benchmark, it takes ~20 milliseconds to search on 2 different indexes, each with a scoring function and indexes containing between 1,000 and 5,000 documents with hundreds of thousands of words. So it's pretty minor.

Each highlighter performs slightly differently too, so you'd have to study the outcome a bit more carefully to get a better feel for if it works the way you and your team prefer it to work.

If there's any conclusion, other than the boring usual "it depends on your setup and preferences", the performance difference is noticeable but not blowing you away. It makes sense that fvh is a bit faster because you've paid for it by indexing more upfront (the offsets) at the expense of memory.

Comments

Previous:: How I used Parcel to "manually" bundle CSS files in a Remix app May 31, 2023 JavaScript
Next:: Switching from Next.js to Vite + wouter July 28, 2023 Node, React, JavaScript

Related by category:: First impressions of Meilisearch and how it compares to Elasticsearch January 26, 2023 Elasticsearch; The correct way to index data into Elasticsearch with (Python) elasticsearch-dsl May 14, 2021 Elasticsearch; How MDN's site-search works February 26, 2021 Elasticsearch

Related by keyword:: Introducing hylite - a Node code-syntax-to-HTML highlighter written in Bun October 3, 2023 Node, JavaScript, Bun; html2plaintext Python script to convert HTML emails to plain text August 10, 2007 Python; Why I gave up on JQuery UI's autocomplete October 20, 2010 JavaScript

Benchmark comparison of Elasticsearch highlighters

Conclusion

Comments

Related posts