Filtered by macOS

Page 3

Reset

When Docker is too slow, use your host

January 11, 2018
3 comments Web development, Django, macOS, Docker

I have a side-project that is basically a React frontend, a Django API server and a Node universal React renderer. The killer feature is its Elasticsearch database that searches almost 2.5M large texts and 200K named objects. All the data is stored in a PostgreSQL and there's some Python code that copies that stuff over to Elasticsearch for indexing.

Timings for searches in Songsearch
The PostgreSQL database is about 10GB and the Elasticsearch (version 6.1.0) indices are about 6GB. It's moderately big and even though individual searches take, on average ~75ms (in production) it's hefty. At least for a side-project.

On my MacBook Pro, laptop I use Docker to do development. Docker makes it really easy to run one command that starts memcached, Django, a AWS Product API Node app, create-react-app for the search and a separate create-react-app for the stats web app.

At first I tried to also run PostgreSQL and Elasticsearch in Docker too, but after many attempts I had to just give up. It was too slow. Elasticsearch would keep crashing even though I extended my memory in Docker to 4GB.

This very blog (www.peterbe.com) has a similar stack. Redis, PostgreSQL, Elasticsearch all running in Docker. It works great. One single docker-compose up web starts everything I need. But when it comes to much larger databases, I found my macOS host to be much more performant.

So the dark side of this is that I have remember to do more things when starting work on this project. My PostgreSQL was installed with Homebrew and is always running on my laptop. For Elasticsearch I have to open a dedicated terminal and go to a specific location to start the Elasticsearch for this project (e.g. make start-elasticsearch).

The way I do this is that I have this in my Django projects settings.py:


import dj_database_url
from decouple import config


DATABASES = {
    'default': config(
        'DATABASE_URL',
        # Hostname 'docker.for.mac.host.internal' assumes
        # you have at least Docker 17.12.
        # For older versions of Docker use 'docker.for.mac.localhost'
        default='postgresql://peterbe@docker.for.mac.host.internal/songsearch',
        cast=dj_database_url.parse
    )
}

ES_HOSTS = config('ES_HOSTS', default='docker.for.mac.host.internal:9200', cast=Csv())

(Actually, in reality the defaults in the settings.py code is localhost and I use docker-compose.yml environment variables to override this, but the point is hopefully still there.)

And that's basically it. Now I get Docker to do what various virtualenvs and terminal scripts used to do but the performance of running the big databases on the host.

How to rotate a video on OSX with ffmpeg

January 3, 2018
5 comments Linux, macOS

Every now and then, I take a video with my iPhone and even though I hold the camera in landscape mode, the video gets recorded in portrait mode. Probably because it somehow started in portrait and didn't notice that I rotated the phone.

So I'm stuck with a 90° video. Here's how I rotate it:

ffmpeg -i thatvideo.mov -vf "transpose=2" ~/Desktop/thatvideo.mov

then I check that ~/Desktop/thatvideo.mov looks like it should.

I can't remember where I got this command originally but I've been relying on my bash history for a looong time so it's best to write this down.
The "transpose=2" means 90° counter clockwise. "transpose=1" means 90° clockwise.

What is ffmpeg??

If you're here because you Googled it and you don't know what ffmpeg is, it's a command line program where you can "programmatically" do almost anything to videos such as conversion between formats, put text in, chop and trim videos. To install it, install Homebrew then type:

brew install ffmpeg

How's My WiFi?

December 8, 2017
2 comments macOS, JavaScript, Node

This was one of those late-evening-after-the-kids-are-asleep project. Followed by some next-morning-sober-readme-fixes-and-npmjs-paperwork.

It's a little Node script that will open https://fast.com with puppeteer, and record, using document.querySelector('#speed-value') what my current Internet speed is according to that app. It currently only works on OSX but it should be easy to fix for someone handy on Linux or Windows.

You can either run it just once and get a readout. That's basically as useful as opening fast.com in a new browser tab.
The other way is to run it in a loop howsmywifi --loop and sit and watch as it tries to figure out what your Internet speed is after multiple measurements.

Screenshot

That's it!

The whole point of this was for me to get an understanding of what my Internet speed is and if I'm being screwed by Comcast. The measurements are very erratic and they might sporadically depend on channel noise on the WiFi or just packet crowding when other devices is overcrowding the pipes with heavy downloads such as video chatting or watching movies or whatever.

I've seen 98 Mbps with my iPhone on this network. Not so much today.

And Screenshots!

As a bonus, it will take a screenshot (if you pass the --screenshots flag) of the fast.com page each time it has successfully measured. Not sure what to do with this. If you have ideas, let me know.

Yet another Docker 'A ha!' moment

November 5, 2017
2 comments macOS, Docker

tl;dr; To build once and run Docker containers with different files use a volume mount. If that's not an option, like in CircleCI, avoid volume mount and rely on container build every time.

What the heck is a volume mount anyway?

Laugh all you like but after almost year of using Docker I'm still learning the basics. Apparently. This, now, feels laughable but there's a small chance someone else stumbles like I did and they might appreciate this.

If you have a volume mounted for a service in your docker-compose.yml it will basically take whatever you mount and lay that on top of what was in the Docker container. Doing a volume mount into the same working directory as your container is totally common. When you do that the files on the host (the files/directories mounted) get used between each run. If you don't do that, you're stuck with the files, from your host, from the last time you built.

Consider...:

# Dockerfile
FROM python:3.6-slim
LABEL maintainer="mail@peterbe.com"
COPY . /app
WORKDIR /app
CMD ["python", "run.py"]

and...:


#!/usr/bin/env python
if __name__ == '__main__':
    print("hello!")

Let's build it:

$ docker image build -t test:latest .
Sending build context to Docker daemon   5.12kB
Step 1/5 : FROM python:3.6-slim
 ---> 0f1dc0ba8e7b
Step 2/5 : LABEL maintainer "mail@peterbe.com"
 ---> Using cache
 ---> 70cf25f7396c
Step 3/5 : COPY . /app
 ---> 2e95935cbd52
Step 4/5 : WORKDIR /app
 ---> bc5be932c905
Removing intermediate container a66e27ecaab3
Step 5/5 : CMD python run.py
 ---> Running in d0cf9c546fee
 ---> ad930ce66a45
Removing intermediate container d0cf9c546fee
Successfully built ad930ce66a45
Successfully tagged test:latest

And run it:

$ docker container run test:latest
hello!

So basically my little run.py got copied into the container by the Dockerfile. Let's change the file:

$ sed -i.bak s/hello/allo/g run.py
$ python run.py
allo!

But it won't run like that if we run the container again:

$ docker container run test:latest
hello!

So, the container is now built based on a Python file from back of the time the container was built. Two options:

1) Rebuild, or
2) Volume mount in the host directory

This is it! That this is your choice.

Rebuild might take time. So, let's mount the current directory from the host:

$ docker container run -v `pwd`:/app test:latest
allo!

So yay! Now it runs the container with the latest file from my host directory.

The dark side of volume mounts

So, if it's more convenient to "refresh the files in the container" with a volume mount instead of container rebuild, why not always do it for everything?

For one thing, there might be files built inside the container that cease to be visible if you override that workspace with your own volume mount.

The other crucial thing I learned the hard way (seems to obvious now!) is that there isn't always a host directory to mount. In particular, in tecken we use a base ubuntu image and in the run parts of the CircleCI configuration we were using docker-compose run ... with directives (in the docker-compose.yml file) that uses volume mounts. So, the rather cryptic effect was that the files mounted into the container was not the files checked out from the git branch.

The resolution in this case, was to be explicit when running Docker commands in CircleCI to only do build followed by run without a volume mount. In particular, to us it meant changing from docker-compose run frontend lint to docker-compose run frontend-ci lint. Basically, it's a separate directive in the docker-compose.yml file that is exclusive to CI.

In conclusion

I feel dumb for not seeing this clearly before.

The mistake that triggered me was that when I ran docker-compose run test test (first test is the docker compose directive, the second test is the of the script sent to CMD) it didn't change the outputs when I edit the files in my editor. Adding a volume mount to that directive solved it for me locally on my laptop but didn't work in CircleCI for reasons (I can't remember how it errored).

So now we have this:

# In docker-compose.yml

  frontend:
    build:
      context: .
      dockerfile: Dockerfile.frontend
    environment:
      - NODE_ENV=development
    ports:
      - "3000:3000"
      - "35729:35729"
    volumes:
      - $PWD/frontend:/app
    command: start

  # Same as 'frontend' but no volumes or command
  frontend-ci:
    build:
      context: .
      dockerfile: Dockerfile.frontend

"No space left on device" on OSX Docker

October 3, 2017
13 comments Web development, macOS, Docker

UPDATE 2020

As Greg Brown pointed out, the new way is:

docker container prune
docker image prune

Original blog post...


 

If you run out of disk space in your Docker containers on OSX, this is probably the best thing to run:

docker rm $(docker ps -q -f 'status=exited')
docker rmi $(docker images -q -f "dangling=true")

The Problem

This isn't the first time it's happened so I'm blogging about it to not forget. My postgres image in my docker-compose.yml didn't start and since it's linked its problem is "hidden". Running it in the foreground instead you can see what the problem is:

▶ docker-compose run db
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
initdb: could not create directory "/var/lib/postgresql/data/pg_xlog": No space left on device
initdb: removing contents of data directory "/var/lib/postgresql/data"

Docker on OSX

I admit that I have so much to learn about Docker and the learning is slow. Docker is amazing but I think I'm slow to learn because I'm just not that interested as long as it works and I can work on my apps.

It seems to me that there's a cap of all storage of all Docker containers in one big file in OSX. It's capped to 64GB:

▶ cd ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/

com.docker.docker/Data/com.docker.driver.amd64-linux
▶ ls -lh Docker.qcow2
-rw-r--r--@ 1 peterbe  staff    63G Oct  3 08:51 Docker.qcow2

If you run the above mentioned commands (docker rm ...) this file does not shrink but space is freed up. Just like how MongoDB (used to) allocates much more disk space than it actually uses.

If you delete that Docker.qcow2 and restart Docker the space problem goes away but then the problem is that you lose all your active containers which is especially annoying if you have useful data in database containers.

Why didn't I know about machma?!

June 7, 2017
0 comments Linux, macOS, Go

"machma - Easy parallel execution of commands with live feedback"

This is so cool! https://github.com/fd0/machma

It's a command line program that makes it really easy to run any command line program in parallel. I.e. in separate processes with separate CPUs.

Something network bound

Suppose I have a file like this:

▶ wc -l urls.txt
      30 urls.txt

▶ cat urls.txt | head -n 3
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/wntdll.pdb/D74F79EB1F8D4A45ABCD2F476CCABACC2/wntdll.sym
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/firefox.pdb/448794C699914DB8A8F9B9F88B98D7412/firefox.sym
https://s3-us-west-2.amazonaws.com/org.mozilla.crash-stats.symbols-public/v1/d2d1.pdb/CB8FADE9C48E44DA9A10B438A33114781/d2d1.sym

If I wanted to download all of these files with wget the traditional way would be:

▶ time cat urls.txt | xargs wget -q -P ./downloaded/
cat urls.txt  0.00s user 0.00s system 53% cpu 0.005 total
xargs wget -q -P ./downloaded/  0.07s user 0.24s system 2% cpu 14.913 total

▶ ls downloaded | wc -l
      30

▶ du -sh downloaded
 21M    downloaded

So it took 15 seconds to download 30 files that totals 21MB.

Now, let's do it with machama instead:

▶ time cat urls.txt | machma -- wget -q -P ./downloaded/ {}
cat urls.txt  0.00s user 0.00s system 55% cpu 0.004 total
machma -- wget -q -P ./downloaded/ {}  0.53s user 0.45s system 12% cpu 7.955 total

That uses 8 separate processors (because my laptop has 8 CPUs).
Because 30 / 8 ~= 4, it roughly does 4 iterations.

But note, it took 15 seconds to download 30 files synchronously. That's an average of 0.5s per file. The reason it doesn't take 4x0.5 seconds (instead of 8 seconds) is because it's at the mercy of bad luck and some of those 30 spiking a bit.

Something CPU bound

Now let's do something really CPU intensive; Guetzli compression.

▶ ls images | wc -l
  7

▶ time find images -iname '*.jpg' | xargs -I {} guetzli --quality 85 {} compressed/{}
find images -iname '*.jpg'  0.00s user 0.00s system 40% cpu 0.009 total
xargs -I {} guetzli --quality 85 {} compressed/{}  35.74s user 0.68s system 99% cpu 36.560 total

And now the same but with machma:

▶ time find images -iname '*.jpg' | machma -- guetzli --quality 85 {} compressed/{}

processed 7 items (0 failures) in 0:10
find images -iname '*.jpg'  0.00s user 0.00s system 51% cpu 0.005 total
machma -- guetzli --quality 85 {} compressed/{}  58.47s user 0.91s system 546% cpu 10.857 total

Basically, it took only 11 seconds. This time there were fewer images (7) than there was CPUs (8), so basically the poor computer is doing super intensive CPU (and memory) work across all CPUs at the same time. The average time for each of these files is ~5 seconds so it's really interesting that even if you try to do this in parallel execution instead of taking a total of ~5 seconds, it took almost double that.

In conclusion

Such a handy tool to have around for command line stuff. I haven't looked at its code much but it's almost a shame that the project only has 300+ GitHub stars. Perhaps because it's kinda complete and doesn't need much more work.

Also, if you attempt all the examples above you'll notice that when you use the ... | xargs ... approach the stdout and stderr is a mess. For wget, that's why I used -q to silence it a bit. With machma you get a really pleasant color coded live output that tells you the state of the queue, possible failures and an ETA.

Experimenting with Guetzli

May 24, 2017
0 comments Linux, Web development, macOS

tl;dr; Guetzli, the new JPEG compression program from Google can save a bytes with little loss of quality.

Inspired by this blog post about Guetzli I thought I'd try it out with something that's relevant to my project, 300x300 JPGs that can be heavily compressed.

So I installed it (with Homebrew) on my MacBook Pro (late 2013) and picked 7 JPGs I had, and use in SongSearch. Which is interesting because these JPEGs have already been compressed once. They are taken from converting from much larger PNGs with PIL (Pillow) at quality rating 80%. In other words, this is Guetzli on top of PIL.

I ran one iteration for every image for the following qualities: 85%, 90%, 95%, 99%, 100%.

The results on the size are as follows:

Image Average Size (bytes) % Smaller
original 23497.0 0
85% 16025.4 32%
90% 18829.4 20%
95% 21338.1 9.2%
99% 22705.3 3.4%
100% 22919.7 2.5%

So, for example, if you choose the 90% quality you save, on average, 4,667B (4.6KB).

As you might already know, Guetzli is incredibly memory hungry and very very slow. On average each image compression took on average 4-6 seconds (higher quality, shorter times). Meaning, if you like Guetzli you probably need to build around it so that the compression happens in a build step or async somewhere and ideally you don't want to run too many compressions in parallel as it might cause CPU and memory overloading.

Now, how does it look?

Go to https://codepen.io/peterbe/pen/rmPMpm and stare at the screen to see if you can A) see which one is more compressed and B) if the one that is more compressed is too low quality.

What do you think?

Is it worth it?

Is the quality drop too much to save 10% on image sizes?

Please share your thoughts. Perhaps we can re-do this experiment with some slightly larger JPGs.

Time to do concurrent CPU bound work

May 13, 2016
3 comments Python, Linux, macOS

Did you see my blog post about Decorated Concurrency - Python multiprocessing made really really easy? If not, fear not. There, I'm demonstrating how I take a task of creating 100 thumbnails from a large JPG. First in serial, then concurrently, with a library called deco. The total time to get through the work massively reduces when you do it concurrently. No surprise. But what's interesting is that each individual task takes a lot longer. Instead of 0.29 seconds per image it took 0.65 seconds per image (...inside each dedicated processor).

The simple explanation, even from a layman like myself, must be that when doing so much more, concurrently, the whole operating system struggles to keep up with other little subtle tasks.

With deco you can either let Python's multiprocessing just use as many CPUs as your computer has (8 in the case of my Macbook Pro) or you can manually set it. E.g. @concurrent(processes=5) would spread the work across a max of 5 CPUs.

So, I ran my little experiment again for every number from 1 to 8 and plotted the results:

Time elapsed vs. work time

What to take away...

The blue bars is the time it takes, in total, from starting the program till the program ends. The lower the better.

The red bars is the time it takes, in total, to complete each individual task.

Meaning, when the number of CPUs is low you have to wait longer for all the work to finish and when the number of CPUs is high the computer needs more time to finish its work. This is an insight into over-use of operating system resources.

If the work is much much more demanding than this experiment (the JPG is only 3.3Mb and one thumbnail only takes 0.3 seconds to make) you might have a red bar on the far right that is too expensive for your server. Or worse, it might break things so that everything stops.

In conclusion...

Choose wisely. Be aware how "bound" the task is.

Also, remember that if the work of each individual task is too "light", the overhead of messing with multprocessing might actually cost more than it's worth.

The code

Here's the messy code I used:


import time
from PIL import Image
from deco import concurrent, synchronized
import sys

processes = int(sys.argv[1])
assert processes >= 1
assert processes <= 8


@concurrent(processes=processes)
def slow(times, offset):
    t0 = time.time()
    path = '9745e8.jpg'
    img = Image.open(path)
    size = (100 + offset * 20, 100 + offset * 20)
    img.thumbnail(size, Image.ANTIALIAS)
    img.save('thumbnails/{}.jpg'.format(offset), 'JPEG')
    t1 = time.time()
    times[offset] = t1 - t0


@synchronized
def run(times):
    for index in range(100):
        slow(times, index)

t0 = time.time()
times = {}
run(times)
t1 = time.time()
print "TOOK", t1-t0
print "WOULD HAVE TAKEN", sum(times.values())

UPDATE

I just wanted to verify that the experiment is valid that proves that CPU bound work hogs resources acorss CPUs that affects their individual performance.

Let's try to the similar but totally different workload of a Network bound task. This time, instead of resizing JPEGs, it waits for finishing HTTP GET requests.

Network bound

So clearly it makes sense. The individual work withing each process is not generally slowed down much. A tiny bit, but not much. Also, I like the smoothness of the curve of the blue bars going from left to right. You can clearly see that it's reverse logarithmic.

.git/info/exclude, .gitignore and ~/.gitignore_global

April 20, 2016
4 comments Linux, macOS

How did I not know about this until now?! .git/info/exlude is like .gitingore but yours to mess with. Thanks @willkg!

There are three ways to tell Git to ignore files.

.gitignore

A file you check in to the project. It's shared amongst developers on the project. It's just a plain text file where you write one line per file pattern that Git should not ask "Have you forgotten to check this in?"

Certain things that are good to put in there are...:

node_modules/
*.py[co]
.coverage

Ideally, this file should be as small as possible and every entry should confidently be something 100% of the developers on the team will want to ignore. If your particular editor has some convention for storing state or revision files, that does not belong on this file.

A reason to keep it short is that of purity and simplicity. Every edit of this file will require a git commit.

~/.gitignore_global

This is yours to keep and maintain. The file doesn't have to be in your home directory. (The ~/ is UNIX nomenclature for your OS user home directory). You can set it to be anything. Like:

$ git config --global core.excludesfile ~/projects/dotfiles/gitignore-global.txt

Here you put stuff you want to personally ignore in every Git project. New and old.

Good examples of things to put in it are...:

*~
.DS_Store
.env
settings/local.py
pip-log.txt

.git/info/exclude

This is a kinda mix between the two above mentioned ignore files. This is things only you want to ignore in a specific project. More or less "junk files" specific to a project. For example if you, in your Git clone, has some test scripts or a specific log file.

Suppose you have a little hack script or some specific config that is only applicable to the project at hand, this is where you add it. For example...:

run_webapp_uwsgi.sh
analyze_correlation_json_dumps.py

I hope this helps someone else who, like me, didn't know about .git/info/exclude until 2016.

Ctags in Atom on OSX

February 26, 2016
0 comments Web development, macOS

Symbols View setting page
In Atom, by default there's a package called symbols-view. It basically allows you to search for particular functions, classes, variables etc. Most often not by typing but by search based on whatever word the cursor is currently on.

With this all installed and set up I can now press Cmd-alt-Down and it automatically jumps to the definition of that thing. If the result is ambiguous (e.g. two functions called get_user_profile) it'll throw up the usual search dialog at the top.

To have this set up you need to use something called ctags. It's a command line tool.

This Stack Overflow post helped tremendously. The ctags I had installed was something else (presumably put there by installing emacs). So I did:

$ brew install ctags

And then added

alias ctags="`brew --prefix`/bin/ctags"

...in my ~/.bash_profile

Now I can run ctags -R . and it generates a binary'ish file called tags in the project root.

However, the index of symbols in a project greatly varies with different branches. So I need a different tags file for each branch. How to do that? By hihjacking the .git/hooks/post-checkout hook.

Now, this is where things get interesting. Every project has "junk". Stuff you have in your project that isn't files you're likely to edit. So you'll need to list those one by one. Anyway, here's what my post-checkout looks like:


#!/bin/bash

set -x

ctags -R \
  --exclude=build \
  --exclude=.git \
  --exclude=webapp-django/static \
  --exclude=webapp-django/node_modules \
  .

This'll be run every time I check out a branch, e.g. git checkout master.