How I upload Firebase images optimized

September 2, 2021
0 comments JavaScript, Web development, Firebase

I have an app that allows you to upload images. The images are stored using Firebase Storage. Then, once uploaded I have a Firebase Cloud Function that can turn that into a thumbnail. The problem with this is that it takes a long time to wake up the cloud function, the first time, and generating that thumbnail. Not to mention the download of the thumbnail payload for the client. It's not unrealistic that the whole thumbnail generation plus download can take multiple (single digit) seconds. But you don't want to have the user sit and wait that long. My solution is to display the uploaded file in a <img> tag using URL.createObjectURL().

The following code is most pseudo-code but should look familiar if you're used to how Firebase and React/Preact works. Here's the FileUpload component:


interface Props {
  onUploaded: ({ file, filePath }: { file: File; filePath: string }) => void;
  onSaved?: () => void;
}

function FileUpload({
  onSaved,
  onUploaded,
}: Props) => {
  const [file, setFile] = useState<File | null>(null);

  // ...some other state stuff omitted for example.

  useEffect(() => {
    if (file) {
      const metadata = {
        contentType: file.type,
    };

    const filePath = getImageFullPath(prefix, item ? item.id : list.id, file);
    const storageRef = storage.ref();

    uploadTask = storageRef.child(filePath).put(file, metadata);
    uploadTask.on(
      "state_changed",
      (snapshot) => {
        // ...set progress percentage
      },
      (error) => {
        setUploadError(error);
      },
      () => {
        onUploaded({ file, filePath });  // THE IMPORTANT BIT!

        db.collection("pictures")
          .add({ filePath })
          .then(() => { onSaved() })

      }
    }
  }, [file])

  return (
      <input
        type="file"
        accept="image/jpeg, image/png"
        onInput={(event) => {
          if (event.target.files) {
            const file = event.target.files[0];
            validateFile(file);
            setFile(file);
          }
        }}
      />
  );
}

The important "trick" is that we call back after the storage is complete by sending the filePath and the file back to whatever component triggered this component. Now, you can know, in the parent component, that there's going to soon be an image reference with a file path (filePath) that refers to that File object.

Here's a rough version of how I use this <FileUpload> component:


function Images() {

  const [uploadedFiles, setUploadedFiles] = useState<Map<string, File>>(
    new Map()
  );

  return (<div>  
    <FileUpload
      onUploaded={({ file, filePath }: { file: File; filePath: string }) => {
        const newMap: Map<string, File> = new Map(uploadedFiles);
        newMap.set(filePath, file);
        setUploadedFiles(newMap);
      }}
      />

    <ListUploadedPictures uploadedFiles={uploadedFiles}/>
    </div>
  );
}

function ListUploadedPictures({ uploadedFiles}: {uploadedFiles: Map<string, File>}) {

  // Imagine some Firebase Firestore subscriber here
  // that watches for uploaded pictures. 
  return <div>
    {pictures.map(picture => (
      <Picture picture={picture} uploadedFiles={uploadedFiles} />
    ))}
  </div>
}

function Picture({ 
  uploadedFiles,
  picture,
}: {
  uploadedFiles: Map<string, File>;
  picture: {
    filePath: string;
  }
}) {
  const thumbnailURL = getThumbnailURL(filePath, 500);
  const [loaded, setLoaded] = useState(false);

  useEffect(() => {
    const preloadImg = new Image();
    preloadImg.src = thumbnailURL;

    const callback = () => {
      if (mounted) {
        setLoaded(true);
      }
    };
    if (preloadImg.decode) {
      preloadImg.decode().then(callback, callback);
    } else {
      preloadImg.onload = callback;
    }

    return () => {
      mounted = false;
    };
  }, [thumbnailURL]);

  return <img
    style={{
      width: 500,
      height: 500,
      "object-fit": "cover",
    }}
    src={
      loaded
        ? thumbnailURL
        : file
        ? URL.createObjectURL(file)
        : PLACEHOLDER_IMAGE
    }
  />
}

Phew! That was a lot of code. Sorry about that. But still, this is just a summary of the real application code.

The point is that; I send the File object back to the parent component immediately after having uploaded it to Firebase Cloud Storage. Then, having access to that as a File object, I can use that as the thumbnail while I wait for the real thumbnail to come in. Now, it doesn't matter that it takes 1-2 seconds to wake up the cloud function and 1-2 seconds to perform the thumbnail creation, and then 0.1-2 seconds to download the thumbnail. All the while this is happening you're looking at the File object that was uploaded. Visually, the user doesn't even notice the difference. If you refresh the page, that temporary in-memory uploadedFiles (Map instance) is empty so you're now relying on the loading of the thumbnail which should hopefully, at this point, be stored in the browser's native HTTP cache.

The other important part of the trick is that we're using const preloadImg = new Image() for loading the thumbnail. And by relying on preloadImage.decode ? preloadImage.decode().then(...) : preload.onload = ... we can be informed only when the thumbnail has been successfully created and successfully downloaded to make the swap.

10 years a Mozillian, always a Mozillian

August 30, 2021
2 comments Web development, Mozilla, MDN

As of September 2021, I am leaving Mozilla after 10 years. It hasn't been perfect but it's been a wonderful time with fond memories and an amazing career rocket ship.

In April 2011, I joined as a web developer to work on internal web applications that support the Firefox development engineering. In rough order, I worked on...

  • Elmo: The web application for managing the state of Firefox localization
  • Socorro: When Firefox crashes and asks to send a crash dump, this is the storage plus website for analyzing that
  • Peekaboo: When people come to visit a Mozilla office, they sign in on a tablet at the reception desk
  • Balrog: For managing what versions are available for Firefox products to query when it's time to self-upgrade
  • Air Mozilla: For watching live streams and video archive of all recordings within the company
  • MozTrap: When QA engineers need to track what, and the results, of QA testing Firefox products
  • Symbol Server: Where all C++ debug symbols are stored from the build pipeline to be used to source-map crash stack traces
  • Buildhub: To get a complete database of all and every individual build shipped of Firefox products
  • Remote Settings: Managing experiments and for Firefox to "phone home" for smaller updates/experiments between releases
  • MDN Web Docs: Where web developers go to look up all the latest and most detailed details about web APIs

This is an incomplete list because at Mozilla you get to help each other and I shipped a lot of smaller projects too, such as Contribute.json, Whatsdeployed, GitHub PR Triage, Bugzilla GitHub Bug Linker.

Reflecting back, the highlight of any project is when you get to meet or interact with the people you help. Few things are as rewarding as when someone you don't know, in person, finds out what you do and they say: "Are you Peter?! The one who built XYZ? I love that stuff! We use it all the time now in my team. Thank you!" It's not a brag because oftentimes what you build for fellow humans it isn't engineering'ly brilliant in any way. It's just something that someone needed. Perhaps the lesson learned is the importance of not celebrating what you've built but just put you into the same room as who uses what you built. And, in fact, if what you've built for someone else isn't particularly loved, by meeting and fully interactive with the people who use "your stuff" gives you the best of feedback and who doesn't love constructive criticism so you can become empowered to build better stuff.

Mozilla is a great company. There is no doubt in my mind. We ship high-quality products and we do it with pride. There have definitely been some rough patches over the years but that happens and you just have to carry on and try to focus on delivering value. Firefox Nightly will continue to be my default browser and I'll happily click any Google search ads to help every now and then. THANK YOU everyone I've ever worked with at Mozilla! You are a wonderful bunch of people!

How to fadeIn and fadeOut like jQuery but with Cash

August 24, 2021
0 comments JavaScript

Remember jQuery? Yeah, it was great. But it was also horrible in its own ways but only when compared to the more powerful tools that we have now as of 2021. I still (almost) use it here on my site. Atually, I use "fork" of jQuery called Cash which calls itself: "An absurdly small jQuery alternative for modern browsers."
Cash is written in TypeScript, which gives me peace of mind, and as a JS bundle, it's only 19KB minified (5.3KB Brotli compressed) whereas jQuery is 87KB minified (27KB Brotli compressed).

But something that jQuery has, that Cash doesn't, is animations. E.g. $('myselector').fadeIn(). If you need to do this with Cash you can use the following pure JavaScript solution:


// Example implementation

const msg = $('<div class="message">')
  .text(`Random message: ${Math.random()}`)
  .css("opacity", 0)
  .css("transition", "opacity 600ms")
  .prependTo($("#root"));
setTimeout(() => msg.css("opacity", 1), 0);

setTimeout(() => {
  msg.css("transition", "opacity 1000ms").css("opacity", 0);
  setTimeout(() => msg.remove(), 1000);
}, 3000);

What this application demonstrates is the creation of a <div> that's immediately injected into the DOM but slowly fades into view. And 3 seconds later it fades out and is removed. Full demo/sample application here.

Sample application using cash like jQuery's $.fadeIn().

The point of the demo is how you can cause the fade-in effect with just Cash but still relies on CSS for the actual animation.
The trick is to, ultimately, create it first like this:


<div class="message" style="opacity:0; transition: opacity 600ms">
  Random message: 0.6517198324628395
</div>

and then, right after it's been added to the DOM, change the style=... to:


-<div class="message" style="opacity:0; transition: opacity 600ms">
+<div class="message" style="opacity:1; transition: opacity 600ms">

What's neat about this is that you use the transition shortcut so it's done entirely with CSS instead of a requestAnimationFrame and/or while-loop like jQuery's effects.js does it.

Note! This is not a polyfill since jQuery's fadeIn() (etc.) can do a lot more such as callbacks. The example might not be great but I hope this little solution becomes useful for someone else who needs this.

Shut the door! How to automate getting the kids to close the door

August 23, 2021
0 comments Family

Like any responsible parent, I get heart palpitations when my kids (and especially their friends) rush through the door, in or out, and just leave the door wide open letting all that sweet sweet air-conditioned coldness rush out like a prison break. My wife gave me these Gibcloser "Safety Door Closer" as a present and we installed them today on all 3 doors (front door, to-basement door, back porch door).

Gibcloser

They're ~$18 on Amazon.com and it appears they come in many different colors. Our doors were white so that that's what we used. It took less than 30 minutes to install on all 3 doors. All you need is a screwdriver and, I would suggest, some thin foam command strips.

Door held at 90° and released.

I know it might sound silly but I've wanted this for so long and it never occurred to me that such a simple solution might exist. Clearly, we've all seen door closers like this before, but almost always bigger. Especially on commercial buildings. But they all seem so complicated and expensive-looking and definitely strike fear in you of: "that'd never work on my home doors". What's neat about these is that they are gentle. It's just a gentle (albeit accelerating) push but if someone/something was to get stuck as it closes, it wouldn't chop off a finger or a foot.

Truth be told, they don't close the door all the way. In the above video, it closes nicely because I held the door open at 90° which is a bit unrealistic angle that someone would forget to close the door. But at least it'll close it enough to stop the flow of cold/warm air going through.

How to submit a form with Playwright

August 3, 2021
0 comments JavaScript

Because it was driving me insane, and because I don't want to ever forget...

Playwright is a wonderful alternative to jest-puppeteer for doing automated headless browser end-to-end testing. But one I couldn't find in the documentation, Google search, or Stackoverflow was: How do you submit a form without clicking a button?. I.e. you have focus in an input field and hit Enter. Here's how you do it:


await page.$eval('form[role="search"]', (form) => form.submit());

The first part is any CSS selector that gets you to the <form> element. In this case, imagine it was:


<form action="/search" role="search">
  <input type="search" name="q">
</form>

You, or my future self, might be laughing at me for missing something obvious but this one took me forever to solve so I thought I'd better blog about it in case someone else gets into the same jam.

UPDATE (Sep 2021)

I found a much easier way:


await page.keyboard.press("Enter");

This obviously only works when you've typed something into an input so the focus is on that <input> element. E.g.:


await page.fill('input[aria-label="New shopping list item"]', "Carrots");
await page.keyboard.press("Enter");

How to install Python Poetry in GitHub Actions in MUCH faster way

July 27, 2021
0 comments Python

We use Poetry in a GitHub project. There's a pyproject.toml file (and a poetry.lock file) which with the help of the executable poetry gets you a very reliable Python environment. The only problem is that adding the poetry executable is slow. Like 10+ seconds slow. It might seem silly but in the project I'm working on, that 10+s delay is the slowest part of a GitHub Action workflow which needs to be fast because it's trying to post a comment on a pull request as soon as it possibly can.

Installing poetry being the slowest partt
First I tried caching $(pip cache dir) so that the underlying python -v pip install virtualenv -t $tmp_dir that install-poetry.py does would get a boost from avoiding network. The difference was negligible. I also didn't want to get too weird by overriding how the install-poetry.py works or even make my own hacky copy. I like being able to just rely on the snok/install-poetry GitHub Action to do its thing (and its future thing).

The solution was to cache the whole $HOME/.local directory. It's as simple as this:


- name: Load cached $HOME/.local
  uses: actions/cache@v2.1.6
  with:
    path: ~/.local
    key: dotlocal-${{ runner.os }}-${{ hashFiles('.github/workflows/pr-deployer.yml') }}

The key is important. If you do copy-n-paste this block of YAML to speed up your GitHub Action, please remember to replace .github/workflows/pr-deployer.yml with the name of your .yml file that uses this. It's important because otherwise, the cache might be overzealously hot when you make a change like:


       - name: Install Python poetry
-        uses: snok/install-poetry@v1.1.6
+        uses: snok/install-poetry@v1.1.7
         with:

...for example.

Now, thankfully install-poetry.py (which is the recommended way to install poetry by the way) can notice that it's already been created and so it can omit a bunch of work. The result of this is as follows:

A fast install poetry

From 10+ seconds to 2 seconds. And what's neat is that the optimization is very "unintrusive" because it doesn't mess with how the snok/install-poetry workflow works.

But wait, there's more!

If you dig up our code where we use poetry you might find that it does a bunch of other caching too. In particular, it caches .venv it creates too. That's relevant but ultimately unrelated. It basically caches the generated virtualenv from the poetry install command. It works like this:


- name: Load cached venv
  id: cached-poetry-dependencies
  uses: actions/cache@v2.1.6
  with:
    path: deployer/.venv
    key: venv-${{ runner.os }}-${{ hashFiles('**/poetry.lock') }}-${{ hashFiles('.github/workflows/pr-deployer.yml') }}

...

- name: Install deployer
  run: |
    cd deployer
    poetry install
  if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'

In this example, deployer is just the name of the directory, in the repository root, where we have all the Python code and the pyproject.toml etc. If you have yours at the root of the project you can just do: run: poetry install and in the caching step change it to: path: .venv.

Now, you get a really powerful complete caching strategy. When the caches are hot (i.e. no changes to the .yml, poetry.lock, or pyproject.toml files) you get the executable (so you can do poetry run ...) and all its dependencies in roughly 2 seconds. That'll be hard to beat!

An effective and immutable way to turn two Python lists into one

June 23, 2021
7 comments Python

tl;dr; To make 2 lists into 1 without mutating them use list1 + list2.

I'm blogging about this because today I accidentally complicated my own code. From now on, let's just focus on the right way.

Suppose you have something like this:


winners = [123, 503, 1001]
losers = [45, 812, 332]

combined = winners + losers

that will create a brand new list. To prove that it's immutable:

>>> combined.insert(0, 100)
>>> combined
[100, 123, 503, 1001, 45, 812, 332]
>>> winners
[123, 503, 1001]
>>> losers
[45, 812, 332]

What I originally did was:


winners = [123, 503, 1001]
losers = [45, 812, 332]

combined = [*winners, *losers]

This works the same and that syntax feels very JavaScript'y. E.g.

> var winners = [123, 503, 1001]
[ 123, 503, 1001 ]
> var losers = [45, 812, 332]
[ 45, 812, 332 ]
> var combined = [...winners, ...losers]
[ 123, 503, 1001, 45, 812, 332 ]
> combined.pop()
332
> losers
[ 45, 812, 332 ]

By the way, if you want to filter out duplicates, do this:


>>> a = [1, 2, 3]
>>> b = [2, 3, 4]
>>> list(dict.fromkeys(a + b))
[1, 2, 3, 4]

It's the most performant way to do it if the order is important.

And if you don't care about the order you can use this:

>>> a = [1, 2, 3]
>>> b = [2, 3, 4]
>>> list(set(a + b))
[1, 2, 3, 4]
>>> list(set(b + a))
[1, 2, 3, 4]

How to get all of MDN Web Docs running locally

June 9, 2021
1 comment Web development, MDN

tl;dr; git clone https://github.com/mdn/content.git && cd content && yarn install && yarn start && open http://localhost:5000/ will get you all of MDN Web Docs running on your laptop.

The MDN Web Docs is built from a git repository: github.com/mdn/content. It contains all you need to get all the content running locally. Including search. Embedded inside that repository is a package.json which helps you start a Yari server. Aka. the preview server. It's a static build of the github.com/mdn/yari project which handles client-side rendering, search, an just-in-time server-side rendering server.

Basics

All you need is the following:

▶ git clone https://github.com/mdn/content.git
▶ cd content
▶ yarn install
▶ yarn start

And now open http://localhost:5000 in your browser.

This will now run in "preview server" mode. It's meant for contributors (and core writers) to use when they're working on a git branch. Because of that, you'll see a "Writer's homepage" at the root URL. And when viewing each document, you get buttons about "flaws" and stuff. Looks like this:

Preview server

Alternative ways to download

If you don't want to use git clone you can download the ZIP file. For example:

▶ wget https://github.com/mdn/content/archive/refs/heads/main.zip
▶ unzip main.zip
▶ cd content-main
▶ yarn install
▶ yarn start

At the time of writing, the downloaded Zip file is 86MB and unzipped the directory is 278MB on disk.

When you use git clone, by default it will download all the git history. That can actually be useful. This way, when rendering each document, it can figure out from the git logs when each individual document was last modified. For example:

"Last modified"

If you don't care about the "Last modified" date, you can do a "shallow git clone" instead. Replace the above-mentioned first command with:

▶ git clone --depth 1 https://github.com/mdn/content.git

At the time of writing the shallow cloned content folder becomes 234MB instead of (the deep clone) 302MB.

Just the raw rendered data

Every MDN Web Docs page has an index.json equivalent. Take any MDN page and add /index.json to the URL. For example /en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/slice/index.json

Essentially, this is the intermediate state that's used for server-side rendering the page. A glorified way of sandwiching the content in a header, a footer, and a sidebar to the side. These URLs work on localhost:5000 too. Try http://localhost:5000/en-US/docs/Web/API/Fetch_API/Using_Fetch/index.json for example.

The content for that index.json is built just in time. It also contains a bunch of extra metadata about "flaws"; a system used to highlight things that should be fixed that is somewhat easy to automate. So, it doesn't contain things like spelling mistakes or code snippets that are actually invalid.

But suppose you want all that raw (rendered) data, without any of the flaw detections, you can run this command:

▶ BUILD_FLAW_LEVELS="*:ignore" yarn build

It'll take a while (because it produces an index.html file too). But now you have all the index.json files for everything in the newly created ./build/ directory. It should have created a lot of files:

▶ find build -name index.json | wc -l
   11649

If you just want a subtree of files you could have run it like this instead:

▶ BUILD_FOLDERSEARCH=web/javascript BUILD_FLAW_LEVELS="*:ignore" yarn build

Programmatic API access

The programmatic APIs are all about finding the source files. But you can use the sources to turn that into the built files you might need. Or just to get a list of URLs. To get started, create a file called find-files.js in the root:


const { Document } = require("@mdn/yari/content");

console.log(Document.findAll().count);

Now, run it like this:

▶ export CONTENT_ROOT=files

▶ node find-files.js
11649

Other things you can do with that findAll function:


const { Document } = require("@mdn/yari/content");

const found = Document.findAll({
  folderSearch: "web/javascript/reference/statements/f",
});
for (const document of found.iter()) {
  console.log(document.url);
}

Or, suppose you want to actually build each of these that you find:


const { Document } = require("@mdn/yari/content");
const { buildDocument } = require("@mdn/yari/build");

const found = Document.findAll({
  folderSearch: "web/javascript/reference/statements/f",
});

Promise.all([...found.iter()].map((document) => buildDocument(document))).then(
  (built) => {
    for (const { doc } of built) {
      console.log(doc.title.padEnd(20), doc.popularity);
    }
  }
);

That'll output something like this:

▶ node find-files.js
for                  0.0143
for await...of       0.0129
for...in             0.0748
for...of             0.0531
function declaration 0.0088
function*            0.0122

All the HTML content in production-grade mode

In the most basic form, it will start the "preview server" which is tailored towards building just in time and has all those buttons at the top for writers/contributors. If you want the more "production-grade" version, you can't use the copy of @mdn/yari that is "included" in the mdn/content repo. To do this, you need to git clone mdn/yari and install that. Hang on, this is about to get a bit more advanced:

▶ git clone https://github.com/mdn/yari.git
▶ cd yari
▶ yarn install
▶ yarn build:client
▶ yarn build:ssr
▶ CONTENT_ROOT=../files REACT_APP_DISABLE_AUTH=true BUILD_FLAW_LEVELS="*:ignore" yarn build
▶ CONTENT_ROOT=../files node server/static.js

Now, if you go to something like http://localhost:5000/en-US/docs/Web/Guide/ you'll get the same thing as you get on https://developer.mozilla.org but all on your laptop. Should be pretty snappy.

Is it really entirely offline?

No, it leaks a little. For example, there are interactive examples that uses an iframe that's hardcoded to https://interactive-examples.mdn.mozilla.net/.

There are also external images for example. You might get a live sample that refers to sample images on https://mdn.mozillademos.org/files/.... So that'll fail if you're without WiFi in a spaceship.

Conclusion

Making all of MDN Web Docs available offline is, honestly, not a priority. The focus is on A) a secure production build, and B) a good environment for previewing content changes. But all the pieces are there. Search is a little bit tricky, as an example. When you're running it as a preview server you can't do a full-text search on all the content, but you get a useful autocomplete search widget for navigating between different titles. And the full-text search engine is a remote centralized server that you can't take with you offline.

But all the pieces are there. Somehow. It all depends on your use case and what you're willing to "compromise" on.

The correct way to index data into Elasticsearch with (Python) elasticsearch-dsl

May 14, 2021
0 comments Python, MDN, Elasticsearch

This is how MDN Web Docs uses Elasticsearch. Daily, we build all the content and then upload it all using elasticsearch-dsl using aliases. Because there are no good complete guides to do this, I thought I'd write it down for the next person who needs to do something similar. Let's jump straight into the code. The reader will need a healthy dose of imagination to fill in their details.

Indexing


# models.py

from datetime.datetime import utcnow

from elasticsearch_dsl import Document

PREFIX = "myprefix"


class MyDocument(Document):
    title = Text()
    body = Text()
    # ...

    class Index:
        name = (
            f'{PREFIX}_{utcnow().strftime("%Y%m%d%H%M%S")}'
        )

What's important to note here is that the MyDocument.Index.name is dynamically allocated every single time the module is imported. It's not very important exactly what it is called but it's important that it becomes unique each time.
This means that when you start using MyDocument it will automatically figure out which index to use. Now, it's time to create the index and bulk publish it.


# index.py
# Note! This example code skips over things like progress bars
# and verbose logging and misc sanity checks and stuff.

from elasticsearch.helpers import parallel_bulk
from elasticsearch_dsl import Index
from elasticsearch_dsl.connections import connections

from .models import MyDocument, PREFIX


def index(buildroot: Path, url: str, update=False):
    """
    * 'buildroot' is where the files are we're going to read and index
    * 'url' is the host URL for the Elasticsearch server
    * 'update' is if just want to "cake on" a couple of documents 
      instead of starting over and doing a complete indexing.
    """

    # Connect and stuff
    connections.create_connection(hosts=[url], retry_on_timeout=True)
    connection = connections.get_connection()
    health = connection.cluster.health()
    status = health["status"]
    if status not in ("green", "yellow"):
        raise Exception(f"status {status} not green or yellow")

    if update:
        for name in connection.indices.get_alias():
            if name.startswith(f"{PREFIX}_"):
                document_index = Index(name)
                break
        else:
            raise IndexAliasError(
                f"Unable to find an index called {PREFIX}_*"
            )

    else:
        # Confusingly, `._index` is actually not a private API.
        # It's the documented way you're supposed to reach it.
        document_index = MyDocument._index
        document_index.create()

    def generator():
        for doc in Path(buildroot):
            # The reason for specifying the exact index name is that we might
            # be doing an update and if you don't specify it, elasticsearch_dsl
            # will fall back to using whatever Document._meta.Index automatically
            # becomes in this moment.
            yield to_search(doc, _index=document_index._name).to_dict(True)

    for success, info in parallel_bulk(connection, generator()):
        # 'success' is a boolean
        # 'info' has stuff like:
        #  - info["index"]["error"]
        #  - info["index"]["_shards"]["successful"]
        #  - info["index"]["_shards"]["failed"]
        pass

    if update:
        # When you do an update, Elasticsearch will internally delete the
        # previous docs (based on the _id primary key we set).
        # Normally, Elasticsearch will do this when you restart the cluster
        # but that's not something we usually do.
        # See https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
        document_index.forcemerge()
    else:
        # Now we're going to bundle the change to set the alias to point
        # to the new index and delete all old indexes.
        # The reason for doing this together in one update is to make it atomic.
        alias_updates = [
            {"add": {"index": document_index._name, "alias": PREFIX}}
        ]
        for index_name in connection.indices.get_alias():
            if index_name.startswith(f"{PREFIX}_"):
                if index_name != document_index._name:
                    alias_updates.append({"remove_index": {"index": index_name}})
        connection.indices.update_aliases({"actions": alias_updates})

    print("All done!")



def to_search(file: Path, _index=None):
    with open(file) as f:
        data = json.load(f)
    return MyDocument(
        _index=_index,
        _id=data["identifier"],
        title=data["title"],
        body=data["body"]
    )

A lot is left to the reader as an exercise to fill in but these are the most important operations. It demonstrates how you can

  1. Correctly create indexes
  2. Atomically create an alias and clean up old indexes (and aliases)
  3. How you can add to an existing index

After you've run this you'll see something like this:

$ curl http://localhost:9200/_cat/indices?v
...
health status index                   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   myprefix_20210514141421 vulVt5EKRW2MNV47j403Mw   1   1      11629            0     28.7mb         28.7mb

$ curl http://localhost:9200/_cat/aliases?v
...
alias    index                   filter routing.index routing.search is_write_index
myprefix myprefix_20210514141421 -      -             -              -

Searching

When it comes to using the index, well, it depends on where your code for that is. For example, on MDN Web Docs, the code that searches the index is in an entirely different code-base. It's incidentally Python (and elasticsearch-dsl) in both places but other than that they have nothing in common. So for the searching, you need to manually make sure you write down the name of the index (or name of the alias if you prefer) into the code that searches. For example:


from elasticsearch_dsl import Search

def search(params):
    search_query = Search(index=settings.SEARCH_INDEX_NAME)

    # Do stuff to 'search_query' based on 'params'

    response = search_query.execute()   
    for hit in response:
        # ...

If you're within the same code that has that models.MyDocument in the first example code above, you can simply do things like this:


from elasticsearch_dsl import Index
from elasticsearch_dsl.connections import connections

from .models import PREFIX


def analyze(
    url: str,
    text: str,
    analyzer: str,
):
    connections.create_connection(hosts=[url])
    index = Index(PREFIX)
    analysis = index.analyze(body={"text": text, "analyzer": analyzer})
    # ...

What English stop words overlap with JavaScript reserved keywords?

May 7, 2021
2 comments JavaScript, MDN

The list of stop words in Elasticsearch is:

a, an, and, are, as, at, be, but, by, for, if, in, into, 
is, it, no, not, of, on, or, such, that, the, their, 
then, there, these, they, this, to, was, will, with

The list of JavaScript reserved keywords is:

abstract, arguments, await, boolean, break, byte, case, 
catch, char, class, const, continue, debugger, default, 
delete, do, double, else, enum, eval, export, extends, 
false, final, finally, float, for, function, goto, if, 
implements, import, in, instanceof, int, interface, let, 
long, native, new, null, package, private, protected, 
public, return, short, static, super, switch, synchronized, 
this, throw, throws, transient, true, try, typeof, var, 
void, volatile, while, with, yield

That means that the overlap is:

for, if, in, this, with

And the remainder of the English stop words is:

a, an, and, are, as, at, be, but, by, into, is, it, no, 
not, of, on, or, such, that, the, their, then, there, 
these, they, to, was, will

Why does this matter? It matters when you're writing a search engine on English text that is about JavaScript. Such as, MDN Web Docs. At the time of writing, you can search for this because there's a special case explicitly for that word. But you can't search for for which is unfortunate.

But there's more! I think we should consider certain prototype words to be considered "reserved" because they are important JavaScript words that should not be treated as stop words. For example...