Filtered by JavaScript, Python

Page 7

Reset

What English stop words overlap with JavaScript reserved keywords?

May 7, 2021
2 comments JavaScript, MDN

The list of stop words in Elasticsearch is:

a, an, and, are, as, at, be, but, by, for, if, in, into, 
is, it, no, not, of, on, or, such, that, the, their, 
then, there, these, they, this, to, was, will, with

The list of JavaScript reserved keywords is:

abstract, arguments, await, boolean, break, byte, case, 
catch, char, class, const, continue, debugger, default, 
delete, do, double, else, enum, eval, export, extends, 
false, final, finally, float, for, function, goto, if, 
implements, import, in, instanceof, int, interface, let, 
long, native, new, null, package, private, protected, 
public, return, short, static, super, switch, synchronized, 
this, throw, throws, transient, true, try, typeof, var, 
void, volatile, while, with, yield

That means that the overlap is:

for, if, in, this, with

And the remainder of the English stop words is:

a, an, and, are, as, at, be, but, by, into, is, it, no, 
not, of, on, or, such, that, the, their, then, there, 
these, they, to, was, will

Why does this matter? It matters when you're writing a search engine on English text that is about JavaScript. Such as, MDN Web Docs. At the time of writing, you can search for this because there's a special case explicitly for that word. But you can't search for for which is unfortunate.

But there's more! I think we should consider certain prototype words to be considered "reserved" because they are important JavaScript words that should not be treated as stop words. For example...

How to simulate slow lazy chunk-loading in React

March 25, 2021
0 comments React, JavaScript

Suppose you have one of those React apps that lazy-load some chunk. It just basically means it injects a .js static asset URL into the DOM and once it's downloaded by the browser, it carries on the React rendering with the new code loaded. Well, what if the network is really slow? In local development, it can be hard to simulate this. You can mess with the browser's Devtools to try to slow down the network, but even that can be too fast sometimes.

What I often do is, I take this:


const SettingsApp = React.lazy(() => import("./app"));

...and change it to this:


const SettingsApp = React.lazy(() =>
  import("./app").then((module) => {
    return new Promise((resolve) => {
      setTimeout(() => {
        resolve(module as any);
      }, 10000);
    });
  })
);

Now, it won't load that JS chunk until 10 seconds later. Only temporarily, in local development.

I know it's admittedly just a hack but it's nifty. Just don't forget to undo it when you're done simulating your snail-speed web app.

PS. That resolve(module as any); is for TypeScript. You can just change that to resolve(module); if it's regular JavaScript.

Umlauts (non-ascii characters) with git on macOS

March 22, 2021
0 comments Python, macOS

I edit a file called files/en-us/glossary/bézier_curve/index.html and then type git status and I get this:

▶ git status
...
Changes not staged for commit:
  ...
    modified:   "files/en-us/glossary/b\303\251zier_curve/index.html"

...

What's that?! First of all, I actually had this wrapped in a Python script that uses GitPython to analyze the output of for change in repo.index.diff(None):. So I got...

FileNotFoundError: [Errno 2] No such file or directory: '"files/en-us/glossary/b\\303\\251zier_curve/index.html"'

What's that?!

At first, I thought it was something wrong with how I use GitPython and thought I could force some sort of conversion to UTF-8 with Python. That, and to strip the quotation parts with something like path = path[1:-1] if path.startwith('"') else path

After much googling and experimentation, what totally solved all my problems was to run:

▶ git config --global core.quotePath false

Now you get...:

▶ git status
...
Changes not staged for commit:
  ...
    modified:   files/en-us/glossary/bézier_curve/index.html

...

And that also means it works perfectly fine with any GitPython code that does something with the repo.index.diff(None) or repo.index.diff(repo.head.commit).

Also, we I use the git-diff-action GitHub Action which would fail to spot files that contained umlauts but now I run this:


    steps:
       - uses: actions/checkout@v2
+
+      - name: Config git core.quotePath
+        run: git config --global core.quotePath false
+
       - uses: technote-space/get-diff-action@v4.0.6
         id: git_diff_content
         with:

In JavaScript (Node) which is fastest, generator function or a big array function?

March 5, 2021
0 comments Node, JavaScript

Sorry about the weird title of this blog post. Not sure what else to call it.

I have a function that recursively traverses the file system. You can iterate over this function to do something with each found file on disk. Silly example:


for (const filePath of walker("/lots/of/files/here")) {
  count += filePath.length;
}

The implementation looks like this:


function* walker(root) {
  const files = fs.readdirSync(root);
  for (const name of files) {
    const filepath = path.join(root, name);
    const isDirectory = fs.statSync(filepath).isDirectory();
    if (isDirectory) {
      yield* walker(filepath);
    } else {
      yield filepath;
    }
  }
}

But I wondered; is it faster to not use a generator function since there might an overhead in swapping from the generator to whatever callback does something with each yielded thing. A pure big-array function looks like this:


function walker(root) {
  const files = fs.readdirSync(root);
  const all = [];
  for (const name of files) {
    const filepath = path.join(root, name);
    const isDirectory = fs.statSync(filepath).isDirectory();
    if (isDirectory) {
      all.push(...walker(filepath));
    } else {
      all.push(filepath);
    }
  }
  return all;
}

It gets the same result/outcome.

It's hard to measure this but I pointed it to some large directory with many files and did something silly with each one just to make sure it does something:


const label = "generator";
console.time(label);
let count = 0;
for (const filePath of walker(SEARCH_ROOT)) {
  count += filePath.length;
}
console.timeEnd(label);
const heapBytes = process.memoryUsage().heapUsed;
console.log(`HEAP: ${(heapBytes / 1024.0).toFixed(1)}KB`);

I ran it a bunch of times. After a while, the numbers settle and you get:

  • Generator function: (median time) 1.74s
  • Big array function: (median time) 1.73s

In other words, no speed difference.

Obviously building up a massive array in memory will increase the heap memory usage. Taking a snapshot at the end of the run and printing it each time, you can see that...

  • Generator function: (median heap memory) 4.9MB
  • Big array function: (median heap memory) 13.9MB

Conclusion

The potential swap overhead for a Node generator function is absolutely minuscule. At least in contexts similar to mine.

It's not unexpected that the generator function bounds less heap memory because it doesn't build up a big array at all.

How MDN's site-search works

February 26, 2021
3 comments Web development, Django, Python, MDN, Elasticsearch

tl;dr; Periodically, the whole of MDN is built, by our Node code, in a GitHub Action. A Python script bulk-publishes this to Elasticsearch. Our Django server queries the same Elasticsearch via /api/v1/search. The site-search page is a static single-page app that sends XHR requests to the /api/v1/search endpoint. Search results' sort-order is determined by match and "popularity".

Jamstack'ing

The challenge with "Jamstack" websites is with data that is too vast and dynamic that it doesn't make sense to build statically. Search is one of those. For the record, as of Feb 2021, MDN consists of 11,619 documents (aka. articles) in English. Roughly another 40,000 translated documents. In English alone, there are 5.3 million words. So to build a good search experience we need to, as a static site build side-effect, index all of this in a full-text search database. And Elasticsearch is one such database and it's good. In particular, Elasticsearch is something MDN is already quite familiar with because it's what was used from within the Django app when MDN was a wiki.

Note: MDN gets about 20k site-searches per day from within the site.

Build

Diagram

When we build the whole site, it's a script that basically loops over all the raw content, applies macros and fixes, dumps one index.html (via React server-side rendering) and one index.json. The index.json contains all the fully rendered text (as HTML!) in blocks of "prose". It looks something like this:


{
  "doc": {
    "title": "DOCUMENT TITLE",
    "summary": "DOCUMENT SUMMARY",
    "body": [
      {
        "type": "prose", 
        "value": {
          "id": "introduction", 
          "title": "INTRODUCTION",
          "content": "<p>FIRST BLOCK OF TEXTS</p>"
       }
     },
     ...
   ],
   "popularity": 0.12345,
   ...
}

You can see one here: /en-US/docs/Web/index.json

Indexing

Next, after all the index.json files have been produced, a Python script takes over and it traverses all the index.json files and based on that structure it figures out the, title, summary, and the whole body (as HTML).

Next up, before sending this into the bulk-publisher in Elasticsearch it strips the HTML. It's a bit more than just turning <p>Some <em>cool</em> text.</p> to Some cool text. because it also cleans up things like <div class="hidden"> and certain <div class="notecard warning"> blocks.

One thing worth noting is that this whole thing runs roughly every 24 hours and then it builds everything. But what if, between two runs, a certain page has been removed (or moved), how do you remove what was previously added to Elasticsearch? The solution is simple: it deletes and re-creates the index from scratch every day. The whole bulk-publish takes a while so right after the index has been deleted, the searches won't be that great. Someone could be unlucky in that they're searching MDN a couple of seconds after the index was deleted and now waiting for it to build up again.
It's an unfortunate reality but it's a risk worth taking for the sake of simplicity. Also, most people are searching for things in English and specifically the Web/ tree so the bulk-publishing is done in a way the most popular content is bulk-published first and the rest was done after. Here's what the build output logs:

Found 50,461 (potential) documents to index
Deleting any possible existing index and creating a new one called mdn_docs
Took 3m 35s to index 50,362 documents. Approximately 234.1 docs/second
Counts per priority prefixes:
    en-us/docs/web                 9,056
    *rest*                         41,306

So, yes, for 3m 35s there's stuff missing from the index and some unlucky few will get fewer search results than they should. But we can optimize this in the future.

Searching

The way you connect to Elasticsearch is simply by a URL it looks something like this:

https://USER:PASSWD@HASH.us-west-2.aws.found.io:9243

It's an Elasticsearch cluster managed by Elastic running inside AWS. Our job is to make sure that we put the exact same URL in our GitHub Action ("the writer") as we put it into our Django server ("the reader").
In fact, we have 3 Elastic clusters: Prod, Stage, Dev.
And we have 2 Django servers: Prod, Stage.
So we just need to carefully make sure the secrets are set correctly to match the right environment.

Now, in the Django server, we just need to convert a request like GET /api/v1/search?q=foo&locale=fr (for example) to a query to send to Elasticsearch. We have a simple Django view function that validates the query string parameters, does some rate-limiting, creates a query (using elasticsearch-dsl) and packages the Elasticsearch results back to JSON.

How we make that query is important. In here lies the most important feature of the search; how it sorts results.

In one simple explanation, the sort order is a combination of popularity and "matchness". The assumption is that most people want the popular content. I.e. they search for foreach and mean to go to /en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/forEach not /en-US/docs/Web/API/NodeList/forEach both of which contains forEach in the title. The "popularity" is based on Google Analytics pageviews which we download periodically, normalize into a floating-point number between 1 and 0. At the of writing the scoring function does something like this:

rank = doc.popularity * 10 + search.score

This seems to produce pretty reasonable results.

But there's more to the "matchness" too. Elasticsearch has its own API for defining boosting and the way we apply is:

  • match phrase in the title: Boost = 10.0
  • match phrase in the body: Boost = 5.0
  • match in title: Boost = 2.0
  • match in body: Boost = 1.0

This is then applied on top of whatever else Elasticsearch does such as "Term Frequency" and "Inverse Document Frequency" (tf and if). This article is a helpful introduction.

We're most likely not done with this. There's probably a lot more we can do to tune this myriad of knobs and sliders to get the best possible ranking of documents that match.

Web UI

The last piece of the puzzle is how we display all of this to the user. The way it works is that developer.mozilla.org/$locale/search returns a static page that is blank. As soon as the page has loaded, it lazy-loads JavaScript that can actually issue the XHR request to get and display search results. The code looks something like this:


function SearchResults() {
  const [searchParams] = useSearchParams();
  const sp = createSearchParams(searchParams);
  // add defaults and stuff here
  const fetchURL = `/api/v1/search?${sp.toString()}`;

  const { data, error } = useSWR(
    fetchURL,
    async (url) => {
      const response = await fetch(URL);
      // various checks on the response.statusCode here
      return await response.json();
    }
  );

  // render 'data' or 'error' accordingly here

A lot of interesting details are omitted from this code snippet. You have to check it out for yourself to get a more up-to-date insight into how it actually works. But basically, the window.location (and pushState) query string drives the fetch() call and then all the component has to do is display the search results with some highlighting.

The /api/v1/search endpoint also runs a suggestion query as part of the main search query. This extracts out interest alternative search queries. These are filtered and scored and we issue "sub-queries" just to get a count for each. Now we can do one of those "Did you mean...". For example: search for intersections.

In conclusion

There are a lot of interesting, important, and careful details that are glossed over here in this blog post. It's a constantly evolving system and we're constantly trying to improve and perfect the system in a way that it fits what users expect.

A lot of people reach MDN via a Google search (e.g. mdn array foreach) but despite that, nearly 5% of all traffic on MDN is the site-search functionality. The /$locale/search?... endpoint is the most frequently viewed page of all of MDN. And having a good search engine that's reliable is nevertheless important. By owning and controlling the whole pipeline allows us to do specific things that are unique to MDN that other websites don't need. For example, we index a lot of raw HTML (e.g. <video>) and we have code snippets that needs to be searchable.

Hopefully, the MDN site-search will elevate from being known to be very limited to something now that can genuinely help people get to the exact page better than Google can. Yes, it's worth aiming high!

What's lighter than ExpressJS?

February 25, 2021
0 comments Node, JavaScript

tl;dr; polka is the lightest Node HTTP server package.

Highly unscientific but nevertheless worth writing down. Lightest here refers to the eventual weight added to the node_modules directory which is a reflection of network and disk use.

When you write a serious web server in Node you probably don't care about which one is lightest. It's probably more important which ones are actively maintained, reliable, well documented, and generally "more familiar". However, I was interested in setting up a little Node HTTP server for the benefit of wrapping some HTTP endpoints for an integration test suite.

The test

In a fresh new directory, right after having run: yarn init -y run the yarn add ... and see how big the node_modules directory becomes afterward (du -sh node_modules).

The results

  1. polka: 116K
  2. koa: 1.7M
  3. express: 2.4M
  4. fastify: 8.0M

bar chart

Conclusion

polka is the lightest. But I'm not so sure it matters. But it could if this has to be installed a lot. For example, in CI where you run that yarn install a lot. Then it might save quite a bit of electricity for the planet.

The best and simplest way to parse an RSS feed in Node

February 13, 2021
0 comments Node, JavaScript

There are a lot of 'rss' related NPM packages but I think I've found a combination that is great for parsing RSS feeds. Something that takes up the minimal node_modules and works great. I think the killer combination is

The code impressively simple:


const got = require("got");
const parser = require("fast-xml-parser");

(async function main() {
  const buffer = await got("https://hacks.mozilla.org/feed/", {
    responseType: "buffer",
    resolveBodyOnly: true,
    timeout: 5000,
    retry: 5,
  });
  var feed = parser.parse(buffer.toString());
  for (const item of feed.rss.channel.item) {
    console.log({ title: item.title, url: item.link });
    break;
  }
})();


// Outputs...
// {
//   title: 'MDN localization update, February 2021',
//   url: 'https://hacks.mozilla.org/2021/02/mdn-localization-update-february-2021/'
// }

I like about fast-xml-parser is that it has no dependencies. And it's tiny:

▶ du -sh node_modules/fast-xml-parser
104K    node_modules/fast-xml-parser

The got package is quite a bit larger and has more dependencies. But I still love it. It's proven itself to be very reliable and very pleasant API. Both packages support TypeScript too.

A particular detail I like about fast-xml-parser is that it doesn't try to do the downloading part too. This way, I can use my own preferred library and I could potentially write my own caching code if I want to protect against flaky network.

Sneaky block-scoping variables in JavaScript that eslint can't even detect

February 3, 2021
0 comments JavaScript

What do you think this code will print out?


function validateURL(url) {
  if (url.includes("://")) {
    const url = new URL(url);
    return url.protocol === "https:";
  } else {
    return "dunno";
  }
}
console.log(validateURL("http://www.peterbe.com"));

I'll give you a clue that isn't helpful,


▶ eslint --version
v7.19.0

▶ eslint code.js

▶ echo $?
0

OK, the answer is that it crashes:

▶ node code.js
/Users/peterbe/dev/JAVASCRIPT/catching_consts/code.js:3
    const url = new URL(url);
                        ^

ReferenceError: Cannot access 'url' before initialization
    at validateURL (/Users/peterbe/dev/JAVASCRIPT/catching_consts/code.js:3:25)
    at Object.<anonymous> (/Users/peterbe/dev/JAVASCRIPT/catching_consts/code.js:9:13)
...

▶ node --version
v15.2.1

It's an honest and easy mistake to make. If the code was this:


function validateURL(url) {
  const url = new URL(url);
  return url.protocol === "https:";
}
// console.log(validateURL("http://www.peterbe.com"));

you'd get this error:

▶ node code2.js
/Users/peterbe/dev/JAVASCRIPT/catching_consts/code2.js:2
  const url = new URL(url);
        ^

SyntaxError: Identifier 'url' has already been declared

which means node refuses to even start it. But it can't with the original code because of the blocking scope that only happens in runtime.

Easiest solution


function validateURL(url) {
  if (url.includes("://")) {
-   const url = new URL(url);
+   const parsedURL = new URL(url);
-   return url.protocol === "https:";
+   return parsedURL.protocol === "https:";
  } else {
    return "dunno";
  }
}
console.log(validateURL("http://www.peterbe.com"));

Best solution

Switch to TypeScript.

▶ cat code.ts
function validateURL(url: string) {
  if (url.includes('://')) {
    const url = new URL(url);
    return url.protocol === 'https:';
  } else {
    return "dunno";
  }
}
console.log(validateURL('http://www.peterbe.com'));

▶ tsc --noEmit --lib es6,dom code.ts
code.ts:3:25 - error TS2448: Block-scoped variable 'url' used before its declaration.

3     const url = new URL(url);
                          ~~~

  code.ts:3:11
    3     const url = new URL(url);
                ~~~
    'url' is declared here.


Found 1 error.

useSearchParams as a React global state manager

February 1, 2021
0 comments React, JavaScript

tl;dr; The useSearchParams hook from react-router is great as a hybrid state manager in React.

The wonderful react-router has a v6 release coming soon. At the time of writing, 6.0.0-beta.0 is the release to play with. It comes with a React hook called useSearchParams and it's fantastic. It's not a global state manager, but it can be used as one. It's not persistent, but it's semi-persistent in that state can be recovered/retained in browser refreshes.

Basically, instead of component state (e.g. React.useState()) you use:


import React from "react";
import { createSearchParams, useSearchParams } from "react-router-dom";
import "./styles.css";

export default function App() {
  const [searchParams, setSearchParams] = useSearchParams();

  const favoriteFruit = searchParams.get("fruit");
  return (
    <div className="App">
      <h1>Favorite fruit</h1>
      {favoriteFruit ? (
        <p>
          Your favorite fruit is <b>{favoriteFruit}</b>
        </p>
      ) : (
        <i>No favorite fruit selected yet.</i>
      )}

      {["🍒", "🍑", "🍎", "🍌"].map((fruit) => {
        return (
          <p key={fruit}>
            <label htmlFor={`id_${fruit}`}>{fruit}</label>
            <input
              type="radio"
              value={fruit}
              checked={favoriteFruit === fruit}
              onChange={(event) => {
                setSearchParams(
                  createSearchParams({ fruit: event.target.value })
                );
              }}
            />
          </p>
        );
      })}
    </div>
  );
}

See Codesandbox demo here

To get a feel for it, try the demo page in Codesandbox and note has it basically sets ?fruit=🍌 in the URL and if you refresh the page, it just continues as if the state had been persistent.

Basically, that's it. You never have a local component state but instead, you use the current URL as your store, and useSearchParams is your conduit for it. The advantages are:

  1. It's dead simple to use
  2. You get "shared state" across components without needing to manually inform them through prop drilling
  3. At any time, the current URL is a shareable snapshot of the state

The disadvantages are:

  1. It needs to be realistic to serialize it through the URLSearchParams web API
  2. The keys used need to be globally reserved for each distinct component that uses it
  3. You might not want the URL to change

That's all you need to know to get started. But let's dig into some more advanced examples, with some abstractions, to "workaround" the limitations.

To append or to reset

Suppose you have many different components, it's very likely that they don't really know or care about each other. Suppose, the current URL is /page?food=🍔 and if one component does: setSearchParams(createSearchParams({fruit: "🍑"})) what will happen is that the URL will "start over" and become /page?fruit=🍑. In other words, the food=🍔 was lost. Well, this might be a desired effect, but let's assume it's not, so we'll have to make it "append" instead. Here's one such solution:


function appendSearchParams(obj) {
  const sp = createSearchParams(searchParams);
  Object.entries(obj).forEach(([key, value]) => {
    if (Array.isArray(value)) {
      sp.delete(key);
      value.forEach((v) => sp.append(key, v));
    } else if (value === undefined) {
      sp.delete(key);
    } else {
      sp.set(key, value);
    }
  });
  return sp;
}

Now, you can do things like this:


onChange={(event) => {
  setSearchParams(
-    createSearchParams({ fruit: event.target.value })
+    appendSearchParams({ fruit: event.target.value })
  );
}}

See Codesandbox demo here

Now, the two keys work independently of each other. It has a nice "just works feeling".

Note that this appendSearchParams() function implementation solves the case of arrays. You could now call it like this:


{/* Untested, but hopefully the point is demonstrated */}
<div>
  <ul>
    {(searchParams.getAll("languages") || []).map((language) => (
      <li key={language}>{language}</li>
    ))}
  </ul>
  <button
    type="button"
    onClick={() => {
      setSearchParams(
        appendSearchParams({ languages: ["en-US", "sv-SE"] })
      );
    }}
  >
    Select 'both'
  </button>
</div>

...and that will update the URL to become ?languages=en-US&languages=sv-SE.

Serialize it into links

The useSearchParams hook returns a callable setSearchParams() which is basically doing a redirect (uses the useNavigate() hook). But suppose you want to make a link that serializes a "future state". Here's a very basic example:


// Assumes 'import { Link } from "react-router-dom";'

<Link to={`?${appendSearchParams({fruit: "🍌"})}`}>Switch to 🍌</Link>

See Codesandbox demo here

Now, you get nice regular hyperlinks that uses can right-click and "Open in a new tab" and it'll just work.

Type conversion and protection

The above simple examples use strings and array of strings. But suppose you need to do more more advanced type conversions. For example: /tax-calculator?rate=3.14 where you might have something that needs to be deserialized and serialized as a floating point number. Basically, you have to wrap the deserializing in a more careful way. E.g.


function TaxYourImagination() {
  const [searchParams, setSearchParams] = useSearchParams();

  const taxRaw = searchParams.get("tax", DEFAULT_TAX_RATE);
  let tax;
  let taxError;
  try {
    tax = castAndCheck(taxRaw);
  } catch (err) {
    taxError = errl;
  }

  if (taxError) {
    return (
      <div className="error-alert">
        The provided tax rate is invalid: <code>{taxError.toString()}</code>
      </div>
    );
  }
  return <DisplayTax value={tax} onUpdate={(newValue) => { 
    setSearchParams(
      createSearchParams({ tax: newValue.toFixed(2) })
    );
   }}/>;
}

Fastest way to turn HTML into text in Python

January 8, 2021
3 comments Python

tl;dr; selectolax is best for stripping HTML down to plain text.

The problem is that I have 10,000+ HTML snippets that I need to index into Elasticsearch as plain text. (Before you ask, yes I know Elasticsearch has a html_strip text filter but it's not what I want/need to use in this context).
Turns out, stripping the HTML into plain text was actually quite expensive at that scale. So what's the most performant way?

PyQuery


from pyquery import PyQuery as pq

text = pq(html).text()

selectolax


from selectolax.parser import HTMLParser

text = HTMLParser(html).text()

regular expression


import re

regex = re.compile(r'<.*?>')
text = clean_regex.sub('', html)

Results

I wrote a script that iterated through 10,000 files that contains HTML snippets. Note! The snippets aren't complete <html> documents (with a <head> and <body> etc) Just blobs of HTML. The average size is 10,314 bytes (5,138 bytes median).

pyquery
  SUM:    18.61 seconds
  MEAN:   1.8633 ms
  MEDIAN: 1.0554 ms
selectolax
  SUM:    3.08 seconds
  MEAN:   0.3149 ms
  MEDIAN: 0.1621 ms
regex
  SUM:    1.64 seconds
  MEAN:   0.1613 ms
  MEDIAN: 0.0881 ms

I've run it a bunch of times. The results are pretty stable.

Point is: selectolax is ~7 times faster than PyQuery

Regex? Really?

No, I don't think I want to use that. It makes me nervous without even attempting to dig up some examples where it goes wrong. It might work just fine for the most basic blobs of HTML. Actually, if the HTML is <p>Foo &amp; Bar</p>, I expect the plain text transformation should be Foo & Bar, not Foo &amp; Bar.

More pressing, both PyQuery and selectolax supports something very specific but important to my use case. I need to remove certain tags (and its content) before I proceed. For example:


<h4 class="warning">This should get stripped.</h4>
<p>Please keep.</p>
<div style="display: none">This should also get stripped.</div>

That can never be done with a regex.

Version 2.0

So my requirement will probably change but basically, I want to delete certain tags. E.g. <div class="warning"> and <div class="hidden"> and <div style="display: none">. So let's implement that:

PyQuery


from pyquery import PyQuery as pq

_display_none_regex = re.compile(r'display:\s*none')

doc = pq(html)
doc.remove('div.warning, div.hidden')
for div in doc('div[style]').items():
    style_value = div.attr('style')
    if _display_none_regex.search(style_value):
        div.remove()
text = doc.text()

selectolax


from selectolax.parser import HTMLParser

_display_none_regex = re.compile(r'display:\s*none')

tree = HTMLParser(html)
for tag in tree.css('div.warning, div.hidden'):
    tag.decompose()
for tag in tree.css('div[style]'):
    style_value = tag.attributes['style']
    if style_value and _display_none_regex.search(style_value):
        tag.decompose()
text = tree.body.text()

This actually works. When I now run the same benchmark for 10,000 of these are the new results:

pyquery
  SUM:    21.70 seconds
  MEAN:   2.1701 ms
  MEDIAN: 1.3989 ms
selectolax
  SUM:    3.59 seconds
  MEAN:   0.3589 ms
  MEDIAN: 0.2184 ms
regex
  Skip

Again, selectolax beats PyQuery by a factor of ~6.

Conclusion

Regular expressions are fast but weak in power. Makes sense.

This selectolax is very impressive.
I got the inspiration from this blog post which sets out to do something very similar to what I'm doing.

I hope this helps someone. Thank you Artem Golubin of selectolax and @lexborisov for Modest which selectolax is built upon.