Peterbe.com

Peter Bengtsson's blog

Filtered by JavaScript, Python

Page 28

Best non-cryptographic hashing function in Python (size and speed)

February 21, 2015
11 comments Python

First of all; hashing is hard. But fortunately it gets a little bit easier if it doesn't have to cryptographic. A non-cryptographic hashing function is basically something that takes a string and converts it to another string in a predictable fashion and it tries to do it with as few clashes as possible and as fast as possible.

MD5 is a non-cryptographic hashing function. Unlike things like sha256 or sha512 the MD5 one is a lot more predictable.

Now, how do you make a hashing function that yields a string that is as short as possible? The simple answer is to make the output use as many different characters as possible. If a hashing function only returns integers you only have 10 permutations per character. If you instead use a-z and A-Z and 0-9 you now have 26 + 26 + 10 permutations per character.

A hex on the other hand only uses 0-9 and a-f which is only 10 + 6 permutations. So you need a longer string to be sure it's unique and can't clash with another hash output. Git for example uses a 40 character log hex string to prepresent a git commit. GitHub is using an appreviated version of that in some of the web UI of only 7 characters which they get away with because things are often in a context of a repo name or something like that. For example github.com/peterbe/django-peterbecom/commit/462ae0c

So, what other choices do you have when it comes to returning a hash output that is sufficiently long that it's "almost guaranteed" to be unique but sufficiently short that it becomes practical in terms of storage space? I have an app for example that turns URLs into unique IDs because they're shorter that way and more space efficient to store as values in a big database. One such solution is to use a base64 encoding.

Base64 uses a-zA-Z0-9 but you'll notice it doesn't have the "hashing" nature in that it's just a direct translation character by character. E.g.


>>> base64.encodestring('peterbengtsson')
'cGV0ZXJiZW5ndHNzb24=\n'
>>> base64.encodestring('peterbengtsson2')
'cGV0ZXJiZW5ndHNzb24y\n'

I.e. these two strings are different but suppose you were to take only the first 10 characters these would be the same. Basically, here's a terrible hashing function:


def hasher(s):  # this is not a good hashing function
    return base64.encodestring(s)[:10]

So, what we want is a hashing function that returns output that is short and very rarely clashing and does this as fast as possible.

To test this I wrote a script that tried a bunch of different ad-hoc hashing functions. I generate a list of 130,000+ different words with an average length of 15 characters. Then I loop over these words until a hashed output is repeated for a second time. And for each, I take the time it takes to generate the 130,000+ hashes and I multiply that with the total number of bytes. For example, if the hash output is 9 characters each in length that's (130000 * 9) / 1024 ~= 1142Kb. And if it took 0.25 seconds to generate all of those the combined score is 1142 * 0.24 ~= 286 bytes second.

Anyway, here are the results:

h11 100.00  0.217s     1184.4 Kb   257.52 kbs
h6  100.00  1.015s  789.6 Kb    801.52 kbs
h10 100.00  1.096s  789.6 Kb    865.75 kbs
h1  100.00  0.215s  4211.2 Kb   903.46 kbs
h4  100.00  1.017s  921.2 Kb    936.59 kbs

(kbs means "kilobytes seconds")

These are the functions that returned 0 clashes amongst 134,758 unique words. There were others too that I'm not bothering to include because they had clashes. So let's look at these functions:



def h11(w):
    return hashlib.md5(w).hexdigest()[:9]

def h6(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:6]

def h10(w):
    h = hashlib.sha256(w)
    return h.digest().encode('base64')[:6]

def h1(w):
    return hashlib.md5(w).hexdigest()

def h4(w):
    h = hashlib.md5(w)
    return h.digest().encode('base64')[:7]

It's kinda arbitrary to say the "best" one is the one that takes the shortest time multipled by size. Perhaps the size matters more to you in that case the h6() function is better because it returns 6 character strings instead of 9 character strings in h11.

I'm apprehensive about publishing this blog post because I bet I'm doing this entirely wrong. Perhaps there are better ways to digest a hashing function that returns strings that don't need to be base64 encoded. I just haven't found any in the standard library yet.

Almost premature optimization

January 2, 2015
0 comments Python, Web development, Django

In airmozilla the tests almost all derive from one base class whose tearDown deletes the automatically generated settings.MEDIA_ROOT directory and everything in it.

Then there's some code that makes sure a certain thing from the fixtures has a picture uploaded to it.

That means it has do that shutil.rmtree(directory) and that shutil.copy(src, dst) on almost every single test. Some might also not need or depend on it but it's conveninent to put it here.

Anyway, I thought this is all a bit excessive and I could probably optimize that by defining a custom test runner that is first responsible for creating a clean settings.MEDIA_ROOT with the necessary file in it and secondly, when the test suite ends, it deletes the directory.

But before I write that, let's measure how many gazillion milliseconds this is chewing up.

Basically, the tearDown was called 361 times and the _upload_media 281 times. In total, this adds to a whopping total of 0.21 seconds! (of the total of 69.133 seconds it takes to run the whole thing).

I think I'll cancel that optimization idea. Doing some light shutil operations are dirt cheap.

AJAX or not

December 22, 2014
1 comment Web development, AngularJS, JavaScript

From the It-Depends-on-What-You're-Building department.

As a web developer you have a job:

Display a certain amount of database data on the screen
Do it as fast as possible

The first point is these days easily taken care of with the likes of Django or Rails which makes it über easy to write queries that you then use in templates to generate the HTML and voila you have a web page.

The second point is taken care of with a myriad of techniques. It's almost a paradox. The fastest way to render something on the screen is to generate everything on the server and send it wholesome. It means the browser can very quickly (and boosted by GPU) render something on the screen. But if you have a lot of data that needs to be displayed it's often better to send just a little bit of HTML and then let some Javascript kick in and take care of extracting the rest of the information using AJAX.

Here I have prepared three different versions of ways to display a bunch of information on the screen:

https://www.peterbe.com/ajaxornot/

What you should note and take away from this little experimental playground:

All server-side work is done in Django but it's served straight out of memcache so it should be fast server-side.
The content is NOT important. It's just a list of blog posts and their categories and keywords.
To make it somewhat realistic, each version needs to 1) display a JPG and 2) have a Javascript onclick event that throws a confirm() dialog box.
The AngularJS version loads significantly slower but it's not because AngularJS is slow, but because it's able to do so much more later. Loading a Javascript framework is like an investment. Big cost upfront and small cost later when you need more magic to happen without having a complete server refresh.
View 1, 2 and 3 are all three imperfect versions but they illustrate the three major groups of solving the problem stated at the top of this blog post. The other views are attempts of optimizations.
Clearly the "visually fastest" version is the optimization version 5 which is a fork of version 2 which loads, on the server-side, everything that is above the fold and then take care of the content below the fold with AJAX.
See this visual comparison
Optimization version 4 was a silly optimization. It depends on the fact that JSON is more "compact" than HTML. When you Gzip the content, the difference in size doesn't matter anymore. However, it's an interesting technique because it means you can do all business logic rendering stuff in one language without having to depend on AJAX.
Open the various versions in your browser and try to "feel" how pages the load. Ask your inner gutteral heart which version you prefer; do you prefer a completely blank screen and a browser loading spinner or do you prefer to see some skeleton structure first whilst waiting for the bulk content comes in?
See this as a basis of thoughts and demonstration. Remember the very first sentence in this blog post.

One-way data bindings in AngularJS 1.3

December 11, 2014
1 comment AngularJS, JavaScript

You might have heard that AngularJS 1.3 has "one-time bindings" which is that you can print the value of a scope variable with {{ ::somevar }} and that this is really good for performance because it means that once rendered it doesn't add to the list of things that the angular app needs to keep worrying about. I.e. it's one less thing to watch.

But what's a good use case of this? This is a good example.

Because ng-if="true" will cause the DOM element to be re-created it will go back to the scope variable and re-evaluate it.

To then() or to success() in AngularJS

November 27, 2014
19 comments AngularJS, JavaScript

By writing this I'm taking a risk of looking like an idiot who has failed to read the docs. So please be gentle.

AngularJS uses a promise module called $q. It originates from this beast of a project.

You use it like this for example:


angular.module('myapp')
.controller('MainCtrl', function($scope, $q) {
  $scope.name = 'Hello ';
  var wait = function() {
    var deferred = $q.defer();
    setTimeout(function() {
      // Reject 3 out of 10 times to simulate 
      // some business logic.
      if (Math.random() > 0.7) deferred.reject('hell');
      else deferred.resolve('world');
    }, 1000);
    return deferred.promise;
  };

  wait()
  .then(function(rest) {
    $scope.name += rest;
  })
  .catch(function(fallback) {
    $scope.name += fallback.toUpperCase() + '!!';
  });
});

Basically you construct a deferred object and return its promise. Then you can expect the .then and .catch to be called back if all goes well (or not).

There are other ways you can use it too but let's stick to the basics to drive home this point to come.

Then there's the $http module. It's where you do all your AJAX stuff and it's really powerful. However, it uses an abstraction of $q and because it is an abstraction it renames what it calls back. Instead of .then and .catch it's .success and .error and the arguments you get are different. Both expose a catch-all function called .finally. You can, if you want to, bypass this abstraction and do what the abstraction does yourself. So instead of:


$http.get('https://api.github.com/users/peterbe/gists')
.success(function(data) {
  $scope.gists = data;
})
.error(function(data, status) {
  console.error('Repos error', status, data);
})
.finally(function() {
  console.log("finally finished repos");
});

...you can do this yourself...:


$http.get('https://api.github.com/users/peterbe/gists')
.then(function(response) {
  $scope.gists = response.data;
})
.catch(function(response) {
  console.error('Gists error', response.status, response.data);
})
.finally(function() {
  console.log("finally finished gists");
});

It's like it's built specifically for doing HTTP stuff. The $q modules doesn't know that the response body, the HTTP status code and the HTTP headers are important.

However, there's a big caveat. You might not always know you're doing AJAX stuff. You might be using a service from somewhere and you don't care how it gets its data. You just want it to deliver some data. For example, suppose you have an AJAX request cached so that only the first time it needs to do an HTTP GET but all consecutive times you can use the stuff already in memory. E.g. Something like this:


angular.module('myapp')
.controller('MainCtrl', function($scope, $q, $http, $timeout) {

  $scope.name = 'Hello ';
  var getName = function() {
    var name = null;
    var deferred = $q.defer();
    if (name !== null) deferred.resolve(name);
    $http.get('https://api.github.com/users/peterbe')
    .success(function(data) {
      deferred.resolve(data.name);
    }).error(deferred.reject);
    return deferred.promise;
  };

  // Even though we're calling this 3 different times
  // you'll notice it only starts one AJAX request.
  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 1000);

  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 2000);

  $timeout(function() {
    getName().then(function(name) {
      $scope.name = "Hello " + name;
    });    
  }, 3000);
});

And with all the other promise frameworks laying around like jQuery's you will sooner or later forget if it's success() or then() or done() and your goldfish memory (like mine) will cause confusion and bugs.

So is there a way to make $http.<somemethod> return a $q like promise but with the benefit of the abstractions that the $http layer adds?

Here's one such possible solution maybe:


var app = angular.module('myapp');

app.factory('httpq', function($http, $q) {
  return {
    get: function() {
      var deferred = $q.defer();
      $http.get.apply(null, arguments)
      .success(deferred.resolve)
      .error(deferred.resolve);
      return deferred.promise;
    }
  }
});

app.controller('MainCtrl', function($scope, httpq) {

  httpq.get('https://api.github.com/users/peterbe/gists')
  .then(function(data) {
    $scope.gists = data;
  })
  .catch(function(data, status) {
    console.error('Gists error', response.status, response.data);
  })
  .finally(function() {
    console.log("finally finished gists");
  });
});

That way you get the benefit of a one same way for all things that get you data some way or another and you get the nice AJAXy signatures you like.

This is just a prototype and clearly it's not generic to work with any of the shortcut functions in $http like .post(), .put() etc. That can maybe be solved with a Proxy object or some other hack I haven't had time to think of yet.

So, what do you think? Am I splitting hairs or is this something attractive?

A "perma search" in AngularJS

November 18, 2014
0 comments AngularJS, JavaScript

A common thing in many (AngularJS) apps is to have an ng-model input whose content is used to as a filter on an ng-repeat somewhere within the page. Something like this:


<input ng-model="search">
<div ng-repeat="item in items | filter:search">...

Well, what if you want the search you make to automatically become part of the URL so that if you bookmark the search or copy the URL to someone else, the search is still there? It would be really practical. Granted, it's not always that you want this but that's something you can decide.

AngularJS 1.2 (I think) introduced the ability to set reloadOnSearch: false on a route provider and that means that you can do things like $location.hash('something') without it triggering the route provider to re-map the URL and re-start the revelant controller.

So here's a good example of (ab)using that to do a search filter which automatically updates the URL.

Check out the demo: https://www.peterbe.com/permasearch/index.html

This works in HTML5 mode too if you're wondering.

Suppose you use many more things in your filter function other than just a free text ng-modal. Like this:


<input type="text" ng-model="filters.search">
<select ng-model="filters.year">
<option value="">All</option>
<option value="2014">2014</option>
<option value="2013">2013</option>
</select>

You might have some checkboxes and stuff too. All you need to do then is to encode that information in the hash. Something like this might be a good start:


$scope.filters = {};
$scope.$watchCollection('filters', function(value) {
    $location.hash($.param(value)); // a jQuery function
});

And something like this to "unparse" the params.

uwsgi and uid

November 3, 2014
4 comments Python, Linux, Django

So recently, I moved home for this blog. It used to be on AWS EC2 and is now on Digital Ocean. I wanted to start from scratch so I started on a blank new Ubuntu 14.04 and later rsync'ed over all the data bit by bit (no pun intended).

When I moved this site I copied the /etc/uwsgi/apps-enabled/peterbecom.ini file and started it with /etc/init.d/uwsgi start peterbecom. The settings were the same as before:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

But I kept getting this error:

Traceback (most recent call last):
...
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/django/db/backends/postgresql_psycopg2/base.py", line 182, in _cursor
    self.connection = Database.connect(**conn_params)
  File "/var/lib/django/django-peterbecom/venv/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 164, in connect
    conn = _connect(dsn, connection_factory=connection_factory, async=async)
psycopg2.OperationalError: FATAL:  Peer authentication failed for user "django"

What the heck! I thought. I was able to connect perfectly fine with the same config on the old server and here on the new server I was able to do this:

django@peterbecom:~/django-peterbecom$ source venv/bin/activate
(venv)django@peterbecom:~/django-peterbecom$ ./manage.py shell
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from peterbecom.apps.plog.models import *
>>> BlogItem.objects.all().count()
1040

Clearly I've set the right password in the settings/local.py file. In fact, I haven't changed anything and I pg_dump'ed the data over from the old server as is.

I edit edited the file psycopg2/__init__.py and added a print "DSN=", dsn and those details were indeed correct.
I'm running the uwsgi app as user django and I'm connecting to Postgres as user django.

Anyway, what I needed to do to make it work was the following change:

# this is /etc/uwsgi/apps-enabled/peterbecom.ini
[uwsgi]
virtualenv = /var/lib/django/django-peterbecom/venv
pythonpath = /var/lib/django/django-peterbecom
user = django
uid = django   # THIS IS ADDED
master = true
processes = 3
env = DJANGO_SETTINGS_MODULE=peterbecom.settings
module = django_wsgi2:application

The difference here is the added uid = django.

I guess by moving across (I'm currently on uwsgi 1.9.17.1-debian) I get a newer version of uwsgi or something that simply can't just take the user directive but needs the uid directive too. That or something else complicated to do with the users and permissions that I don't understand.

Hopefully, by having blogged about this other people might find it and get themselves a little productivity boost.

Go vs. Python

October 24, 2014
42 comments Python, Go

tl;dr; It's not a competition! I'm just comparing Go and Python. So I can learn Go.

So recently I've been trying to learn Go. It's a modern programming language that started at Google but has very little to do with Google except that some of its core contributors are staff at Google.

The true strength of Go is that it's succinct and minimalistic and fast. It's not a scripting language like Python or Ruby but lots of people write scripts with it. It's growing in popularity with systems people but web developers like me have started to pay attention too.

The best way to learn a language is to do something with it. Build something. However, I don't disagree with that but I just felt I needed to cover the basics first and instead of taking notes I decided to learn by comparing it to something I know well, Python. I did this a zillion years ago when I tried to learn ZPT by comparing it DTML which I already knew well.

My free time is very limited so I'm taking things by small careful baby steps. I read through An Introduction to Programming in Go by Caleb Doxey in a couple of afternoons and then I decided to spend a couple of minutes every day with each chapter and implement something from that book and compare it to how you'd do it in Python.

I also added some slightly more full examples, Markdownserver which was fun because it showed that a simple Go HTTP server that does something can be 10 times faster than the Python equivalent.

What I've learned

Go is very unforgiving but I kinda like it. It's like Python but with pyflakes switched on all the time.
Go is much more verbose than Python. It just takes so much more lines to say the same thing.
Goroutines are awesome. They're a million times easier to grok than Python's myriad of similar solutions.
In Python, the ability to write to a list and it automatically expanding at will is awesome.
Go doesn't have the concept of "truthy" which I already miss. I.e. in Python you can convert a list type to boolean and the language does this automatically by checking if the length of the list is 0.
Go gives you very few choices (e.g. there's only one type of loop and it's the for loop) but you often have a choice to pass a copy of an object or to pass a pointer. Those are different things but sometimes I feel like the computer could/should figure it out for me.
I love the little defer thing which means I can put "things to do when you're done" right underneath the thing I'm doing. In Python you get these try: ...20 lines... finally: ...now it's over... things.
The coding style rules are very different but in Go it's a no brainer because you basically don't have any choices. I like that. You just have to remember to use gofmt.
Everything about Go and Go tools follow the strict UNIX pattern to not output anything unless things go bad. I like that.
godoc.org is awesome. If you ever wonder how a built in package works you can just type it in after godoc.org like this godoc.org/math for example.
You don't have to compile your Go code to run it. You can simply type go run mycode.go it automatically compiles it and then runs it. And it's super fast.
go get can take a url like github.com/russross/blackfriday and just install it. No PyPI equivalent. But it scares me to depend on peoples master branches in GitHub. What if master is very different when I go get something locally compared to when I run go get weeks/months later on the server?

UPDATE

Here's a similar project comparing Python vs. JavaScript by Ilya V. Schurov

localForage vs. XHR

October 22, 2014
9 comments JavaScript

tl;dr; Fetching from IndexedDB is about 5-15 times faster than fetching from AJAX.

localForage is a wrapper for the browser that makes it easy to work with any local storage in the browser. Different browsers have different implementations. By default, when you use localForage in Firefox is that it used IndexedDB which is asynchronous by default meaning your script don't get blocked whilst waiting for data to be retrieved.

A good pattern for a "fat client" (lots of javascript, server primarly speaks JSON) is to download some data, by AJAX using JSON and then store that in the browser. Next time you load the page, you first read from the local storage in the browser whilst you wait for a fresh new JSON from the server. That way you can present data to the screen sooner. (This is how Buggy works, blogged about it here)

Another similar pattern is that you load everything by AJAX from the server, present it and store it in the local storage. Then you perdiocally (or just on onload) you send the most recent timestamp from the data you've received and the server gives you back everything new and everything that has changed by that timestamp. The advantage of this is that the payload is continuously small but the server has to make a custom response for each client whereas a big fat blob of JSON can be better cached and such. However, oftentimes the data is dependent on your credentials/cookie anyway so most possibly you can't do much caching.

Anyway, whichever pattern you attempt I thought it'd be interesting to get a feel for how much faster it is to retrieve from the browsers memory compared to doing a plain old AJAX GET request. After all, browsers have seriously optimized for AJAX requests these days so basically the only thing standing in your way is network latency.

So I wrote a little comparison script that tests this. It's here: https://www.peterbe.com/localforage-vs-xhr/index.html

It retrieves a 225Kb JSON blob from the server and measures how long that took to become an object. Equally it does the same with localforage.getItem and then it runs this 10 times and compares the times. It's obviously not a surprise the local storage retrieval is faster, what's interesting is the difference in general.

What do you think? I'm sure both sides can be optimized but at this level it feels quite realistic scenarios.

django-html-validator

October 20, 2014
2 comments Python, Web development, Django

A couple of weeks ago we had accidentally broken our production server (for a particular report) because of broken HTML. It was an unclosed tag which rendered everything after that tag to just plain white. Our comprehensive test suite failed to notice it because it didn't look at details like that. And when it was tested manually we simply missed the conditional situation when it was caused. Neither good excuses. So it got me thinking how can we incorporate HTML (html5 in particular) validation into our test suite.

So I wrote a little gist and used it a bit on a couple of projects and was quite pleased with the results. But I thought this might be something worthwhile to keep around for future projects or for other people who can't just copy-n-paste a gist.

With that in mind I put together a little package with a README and a setup.py and now you can use it too.

There are however some caveats. Especially if you intend to run it as part of your test suite.

Caveat number 1

You can't flood htmlvalidator.nu. Well, you can I guess. It would be really evil of you and kittens will die. If you have a test suite that does things like response = self.client.get(reverse('myapp:myview')) and there are many tests you might be causing an obscene amount of HTTP traffic to them. Which brings us on to...

Caveat number 2

The htmlvalidator.nu site is written in Java and it's open source. You can basically download their validator and point django-html-validator to it locally. Basically the way it works is java -jar vnu.jar myfile.html. However, it's slow. Like really slow. It takes about 2 seconds to run just one modest HTML file. So, you need to be patient.