Peterbe.com

Peter Bengtsson's blog

Filtered by Python

Page 11

Autocompeter is Dead. Long live Autocompeter!

January 9, 2017
0 comments Python, Web development, Go

About 2 years ago I launched Autocompeter.com. It was two parts:

1) A autocompeter.js pure JavaScript solution to add autocomplete to a search input field.
2) A REST API where you can submit titles with a HTTP header key, and a fancy autocomplete search.

Only Rewrote the Go + Redis part

The second part has now been completely re-written. The server was originally written in Go and used Redis. Now it's Django and ElasticSearch.

The ultimate reason for this was that Redis was, by far, the biggest memory consumer on my shared DigitalOcean server. The way it worked was that every prefix of every word in every title was indexes as a key. For example the words p, pe, pet, pete, peter and peter$ are all keys and they point to an array of IDs that you then look up to get the distinct set of titles and their URLs. This makes it really really fast but since redis doesn't support namespaces, or multiple columns it means that for every prefix it needs a prefix of its own for the domain they belong to. So the hash for www.peterbe.com is eb9f747 so the strings to store are instead eb9f747p, eb9f747pe, eb9f747pet, eb9f747pete, eb9f747peter and eb9f747peter$.

ElasticSearch on the other hand has ALL of this built in deep in Lucene. AND you can filter. So the way it's queried now instead is something like this:


search = TitleDoc.search()
search = search.filter('term', domain=domain.name)
search = search.query(Q('match_phrase', title=request.GET['q']))
search = search.sort('-popularity', '_score')
search = search[:size]
response = search.execute()
...

And here's how the mapping is defined:


from elasticsearch_dsl import (
    DocType,
    Float,
    Text,
    Index,
    analyzer,
    Keyword,
    token_filter,
)


edge_ngram_analyzer = analyzer(
    'edge_ngram_analyzer',
    type='custom',
    tokenizer='standard',
    filter=[
        'lowercase',
        token_filter(
            'edge_ngram_filter', type='edgeNGram',
            min_gram=1, max_gram=20
        )
    ]
)


class TitleDoc(DocType):
    id = Keyword()
    domain = Keyword(required=True)
    url = Keyword(required=True, index=False)
    title = Text(
        required=True,
        analyzer=edge_ngram_analyzer,
        search_analyzer='standard'
    )
    popularity = Float()
    group = Keyword()

I'm learning ElasticSearch rapidly but I still feel like I have so much to learn. This solution I have here is quite good and I'm pretty happy with the results but I bet there's a lot of things I can learn to make it even better.

Why Ditch Go?

I actually had a lot of fun building the first server version of Autocompeter in Go but Django is just so many times more convenient. It's got management commands, ORM, authentication system, CSRF protection, awesome error reporting, etc. All built in! With Go I had to build everything from scratch.

Also, I felt like the important thing here is the JavaScript client and the database. Now that I've proven this to work with Django and elasticsearch-dsl I think it wouldn't be too hard to re-write the critical query API in Go or in something like Sanic for maximum performance.

All Dockerized

Oh, one of the reasons I wanted to do this new server in Python is because I want to learn Docker better and in particular Docker with Python projects.

The project is now entirely contained in Docker so you can start the PostgreSQL, ElasticSearch 5.1.1 and Django with docker-compose up. There might be a couple of things I've forgot to document for how to configure things but this is actually the first time I've developed something entirely in Docker.

ElasticSearch 5 in Travis-CI

January 6, 2017
0 comments Python, Linux, Web development

tl;dr; Here's a working .travis.yml file that works with ElasticSearch 5.1.1

I had to jump through hoops to get Travis-CI to run with ElasticSearch 5.1.1 and I thought I'd share. If you just do:

services:
  - elasticsearch

This is from the Travis-CI documentation but this installs ElasticSearch 1.4. Not good enough. The instructions on the same page for using higher versions did not work for me.

To get a specific version you need to download it yourself and install it with dpkg -i but the problem is that if you want to use ElasticSearch version 5, you need to have Java 1.8. The short answer is that this is how you install Java 1.8:

addons:
  apt:
    packages:
      - oracle-java8-set-default

But now you need to sudo so you need to add sudo: true in your .travis.yml. Bummer, because it makes the build a bit slower. However, a necessary evil.

The critical line I use to install it is this:

curl -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.1.deb && \
sudo dpkg -i --force-confnew elasticsearch-5.1.1.deb && \
sudo service elasticsearch start

I thought I could "upgrade" the existing install, but that breaks thinks. In other words you have to remove the services: - elasticsearch line or else it can't upgrade.

Now, during debugging I was not getting errors on the line:

sudo service elasticsearch start

So I add this to be sure the right version got installed:

#!/bin/bash
curl -v http://localhost:9200/

and then I can see that the right version was installed. It should look something like this:

* About to connect() to localhost port 9200 (#0)
*   Trying 127.0.0.1... connected
> GET / HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: localhost:9200
> Accept: */*
> 
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 327
< 
{
  "name" : "m_acpqT",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "b4_KnK6KQmSx64C9o-81Ug",
  "version" : {
    "number" : "5.1.1",
    "build_hash" : "5395e21",
    "build_date" : "2016-12-06T12:36:15.409Z",
    "build_snapshot" : false,
    "lucene_version" : "6.3.0"
  },
  "tagline" : "You Know, for Search"
}
* Connection #0 to host localhost left intact
* Closing connection #0

Note the line that says "number" : "5.1.1",.

So, yay! Hopefully this will help someone else because it took me quite a while to get right.

Using Fanout.io in Django

December 13, 2016
1 comment Python, Web development, Django, Mozilla, JavaScript

Earlier this year we started using Fanout.io in Air Mozilla to enhance the experience for users awaiting content updates. Here I hope to flesh out its details a bit to inspire others to deploy a similar solution.

What It Is

First of all, Fanout.io is basically a service that handles your WebSockets. You put in some of Fanout's JavaScript into your site that handles a persistent WebSocket connection between your site and Fanout.io. And to push messages to your user you basically send them to Fanout.io from the server and they "forward" it to the WebSocket.

The HTML page looks like this:


<html>
<body>

  <h1>Web Page</h1>

<!-- replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page -->
<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
></script>
<script src="fanout.js"></script>
</body>
</html>

And the fanout.js script looks like this:


window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
     console.log('Incoming updated data from the server:', data);
  })
};

And in server it looks something like this:


from django.conf import settings
import fanout

fanout.realm = settings.FANOUT_REALM_ID
fanout.key = settings.FANOUT_REALM_KEY


def post_comment(request):
    """A django view function that saves the posted comment"""
   text = request.POST['comment']
   saved_comment = Comment.objects.create(text=text, user=request.user)
   fanout.publish('mycomments', {'new_comment': saved_comment.id})
   return http.JsonResponse({'comment_posted': True})

Note that, in the client-side code, there's no security since there's no authentication. Any client can connect to any channel. So it's important that you don't send anything sensitive. In fact, you should think of this pattern simply as a hint that something has changed. For example, here's a slightly more fleshed out example of how you'd use the subscription.


window.onload = function() {
  // replace the FANOUT_REALM_ID with the ID you get in the Fanout.io admin page
  var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux')
  client.subscribe('/mycomments', function(data) {  
    if (data.new_comment) {
      // server says a new comment has been posted in the server
      $.json('/comments', function(response) {
        $('#comments .comment').remove();
        $.each(response.comments, function(comment) {        
          $('<div class="comment">')
          .append($('<p>').text(comment.text))
          .append($('<span>').text('By: ' + comment.user.name))
          .appendTo('#comments');
        });
      });
    }
  })
};

Yes, I know jQuery isn't hip but it demonstrates the pattern well. Also, in the real world you might not want to ask the server for all comments (and re-render) but instead do an AJAX query to get all new comments since some parameter or something.

Why It's Awesome

It's awesome because you can have a simple page that updates near instantly when the server's database is updated. The alternative would be to do a setInterval loop that frequently does an AJAX query to see if there's new content to update. This is cumbersome because it requires a lot heavier AJAX queries. You might want to make it secure so you engage sessions that need to be looked up each time. Or, since you're going to request it often you have to write a very optimized server-side endpoint that is cheap to query often.

And last but not least, if you rely on an AJAX loop interval, you have to pick a frequency that your server can cope with and it's likely to be in the range of several seconds or else it might overload the server. That means that updates are quite delayed.

But maybe most important, you don't need to worry about running a WebSocket server. It's not terribly hard to do one yourself on your laptop with a bit of Node Express or Tornado but now you have yet another server to maintain and it, internally, needs to be connected to a "pub-sub framework" like Redis or a full blown message queue.

Alternatives

Fanout.io is not the only service that offers this. The decision to use Fanout.io was taken about a year ago and one of the attractive things it offers is that it's got a freemium option which is ideal for doing local testing. The honest truth is that I can't remember the other justifications used to chose Fanout.io over its competitors but here are some alternatives that popped up on a quick search:

It seems they all (including Fanout.io) has freemium plans, supports authentication, REST APIs (for sending and for querying connected clients' stats).

There are also some more advanced feature packed solutions like Meteor, Firebase and GunDB that act more like databases that are connected via WebSockets or alike. For example, you can have a database as a "conduit" for pushing data to a client. Meaning, instead of sending the data from the server directly you save it in a database which syncs to the connected clients.

Lastly, I've heard that Heroku has a really neat solution that does something similar whereby it sets up something similar as an extension.

Let's Get Realistic

The solution sketched out above is very simplistic. There are a lot more fine-grained details that you'd probably want to zoom in to if you're going to do this properly.

Throttling

In Air Mozilla, we call fanout.publish(channel, message) from a post_save ORM signal. If you have a lot of saves for some reason, you might be sending too many messages to the client. A throttling solution, per channel, simply makes sure your "callback" gets called only once per channel per small time frame. Here's the solution we employed:


window.Fanout = (function() {
  var _locks = {};
  return {
    subscribe: function subscribe(channel, callback) {
      _client.subscribe(channel, function(data) {
          if (_locks[channel]) {
              // throttled
              return;
          }
          _locks[channel] = true;
          callback(data);
          setTimeout(function() {
              _locks[channel] = false;
          }, 500);
      });        
    };
  }
})();

Subresource Integrity

Subresource integrity is an important web security technique where you know in advance a hash of the remote JavaScript you include. That means that if someone hacks the result of loading https://cdn.example.com/somelib.js the browser compares the hash of that with a hash mentioned in the <script> tag and refuses to load it if the hash doesn't match.

In the example of Fanout.io it actually looks like this:


<script 
  src="https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux/static/faye-browser-1.1.2-fanout1-min.js"
  crossOrigin="anonymous"
  integrity="sha384-/9uLm3UDnP3tBHstjgZiqLa7fopVRjYmFinSBjz+FPS/ibb2C4aowhIttvYIGGt9"
></script>

The SHA you get from the Fanout.io documentation. It requires, and implies, that you need to use an exact version of the library. You can't use it like this: <script src="https://cdn.example/somelib.latest.min.js" ....

WebSockets vs. Long-polling

Fanout.io's JavaScript client follows a pattern that makes it compatible with clients that don't support WebSockets. The first technique it uses is called long-polling. With this the server basically relys on standard HTTP techniques but the responses are long lasting instead. It means the request simply takes a very long time to respond and when it does, that's when data can be passed.

This is not a problem for modern browsers. They almost all support WebSocket but you might have an application that isn't a modern browser.

Anyway, what Fanout.io does internally is that it first creates a long-polling connection but then shortly after tries to "upgrade" to WebSockets if it's supported. However, the projects I work only need to support modern browsers and there's a trick to tell Fanout to go straight to WebSockets:


var client = new Faye.Client('https://{{ FANOUT_REALM_ID }}.fanoutcdn.com/bayeux', {
    // What this means is that we're opting to have
    // Fanout start with fancy-pants WebSocket and
    // if that doesn't work it falls back on other
    // options, such as long-polling.
    // The default behaviour is that it starts with
    // long-polling and tries to "upgrade" itself
    // to WebSocket.
    transportMode: 'fallback'
});

Fallbacks

In the case of Air Mozilla, it already had a traditional solution whereby it does a setInterval loop that does an AJAX query frequently.

Because the networks can be flaky or because something might go wrong in the client, the way we use it is like this:


var RELOAD_INTERVAL = 5;  // seconds

if (typeof window.Fanout !== 'undefined') {
    Fanout.subscribe('/' + container.data('subscription-channel-comments'), function(data) {
        // Supposedly the comments have changed.
        // For security, let's not trust the data but just take it
        // as a hint that it's worth doing an AJAX query
        // now.
        Comments.load(container, data);
    });
    // If Fanout doesn't work for some reason even though it
    // was made available, still use the regular old
    // interval. Just not as frequently.
    RELOAD_INTERVAL = 60 * 5;
}
setInterval(function() {
    Comments.reload_loop(container);
}, RELOAD_INTERVAL * 1000);

Use Fanout Selectively/Progressively

In the case of Air Mozilla, there are lots of pages. Some don't ever need a WebSocket connection. For example, it might be a simple CRUD (Create Update Delete) page. So, for that I made the whole Fanout functionality "lazy" and it only gets set up if the page has some JavaScript that knows it needs it.

This also has the benefit that the Fanout resource loading etc. is slightly delayed until more pressing things have loaded and the DOM is ready.

You can see the whole solution here. And the way you use it here.

Have Many Channels

You can have as many channels as you like. Don't create a channel called comments when you can have a channel called comments-123 where 123 is the ID of the page you're on for example.

In the case of Air Mozilla, there's a channel for every single page. If you're sitting on a page with a commenting widget, it doesn't get WebSocket messages about newly posted comments on other pages.

Conclusion

We've now used Fanout for almost a year in our little Django + jQuery app and it's been great. The management pages in Air Mozilla use AngularJS and the integration looks like this in the event manager page:


window.Fanout.subscribe('/events', function(data) {
    $scope.$apply(lookForModifiedEvents);
});

Fanout.io's been great to us. Really responsive support and very reliable. But if I were to start a fresh new project that needs a solution like this I'd try to spend a little time to investigate the competitors to see if there are some neat features I'd enjoy.

UPDATE

Fanout reached out to help explain more what's great about Fanout.io

"One of Fanout's biggest differentiators is that we use and promote open technologies/standards. For example, our service supports the open Bayeux protocol, and you can connect to it with any compatible client library, such as Faye. Nearly all competing services have proprietary protocols. This "open" aspect of Fanout aligns pretty well with Mozilla's values, and in fact you'd have a hard time finding any alternative that works the same way."

Cope with JSONDecodeError in requests.get().json() in Python 2 and 3

November 16, 2016
6 comments Python

UPDATE June 2020

Thanks to commentors, I've been made aware that requests will use simplejson if it's installed to handle the deserialization of the JSON. So if you have simplejson in your requirements.txt or pyproject.toml you have to change to this:


import requests
import simplejson

response = requests.get(url)
try:
    print(response.json())
except simplejson.errors.JSONDecodeError:
    print("N'est pas JSON")

Or, make your app more solid by doing the same trick that requests does:


import requests
try:
    from simplejson.errors import JSONDecodeError
except ImportError:
    from json.decoder import JSONDecodeError

response = requests.get(url)
try:
    print(response.json())
except JSONDecodeError:
    print("N'est pas JSON")

Now, back to the old blog post...

Suppose you don't know with a hundred percent certainty that an API will respond in with a JSON payload you need to protect yourself.

This is how you do it in Python 3:


import json
import requests

response = requests.get(url)
try:
    print(response.json())
except json.decoder.JSONDecodeError:
    print("N'est pas JSON")

This is how you do it in Python 2:


import requests

response = requests.get(url)
try:
    print response.json()
except ValueError:
    print "N'est pas JSON"

Here's how you make the code work across both:


import json
import requests

try:
    from json.decoder import JSONDecodeError
except ImportError:
    JSONDecodeError = ValueError

response = requests.get(url)
try:
    print(response.json())
except JSONDecodeError:
    print("N'est pas JSON")

Optimization of QuerySet.get() with or without select_related

November 3, 2016
1 comment Python, Django, PostgreSQL

If you know you're going to look up a related Django ORM object from another one, Django automatically takes care of that for you.

To illustrate, imaging a mapping that looks like this:


class Artist(models.Models):
    name = models.CharField(max_length=200)
    ...


class Song(models.Models):
    artist = models.ForeignKey(Artist)
    ...

And with that in mind, suppose you do this:


>>> Song.objects.get(id=1234567).artist.name
'Frank Zappa'

Internally, what Django does is that it looks the Song object first, then it does a look up automatically on the Artist. In PostgreSQL it looks something like this:

SELECT "main_song"."id", "main_song"."artist_id", ... FROM "main_song" WHERE "main_song"."id" = 1234567
SELECT "main_artist"."id", "main_artist"."name", ... FROM "main_artist" WHERE "main_artist"."id" = 111

Pretty clear. Right.

Now if you know you're going to need to look up that related field you can ask Django to make a join before the lookup even happens. It looks like this:


>>> Song.objects.select_related('artist').get(id=1234567).artist.name
'Frank Zappa'

And the SQL needed looks like this:

SELECT "main_song"."id", ... , "main_artist"."name", ... 
FROM "main_song" INNER JOIN "main_artist" ON ("main_song"."artist_id" = "main_artist"."id") WHERE "main_song"."id" = 1234567

The question is; which is fastest?

Well, there's only one way to find out and that is to measure with some relatistic data.

Here's the benchmarking code:


def f1(id):
    try:
        return Song.objects.get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def f2(id):
    try:
        return Song.objects.select_related('artist').get(id=id).artist.name
    except Song.DoesNotExist:
        pass

def _stats(r):
    #returns the median, average and standard deviation of a sequence
    tot = sum(r)
    avg = tot/len(r)
    sdsq = sum([(i-avg)**2 for i in r])
    s = list(r)
    s.sort()
    return s[len(s)//2], avg, (sdsq/(len(r)-1 or 1))**.5

times = defaultdict(list)
functions = [f1, f2]
for id in range(100000, 103000):
    for f in functions:
        t0 = time.time()
        r = f(id)
        t1 = time.time()
        if r:
            times[f.__name__].append(t1-t0)
    # Shuffle the order so that one doesn't benefit more
    # from deep internal optimizations/caching in Postgre.
    random.shuffle(functions)

for k, values in times.items():
    print(k, [round(x * 1000, 2) for x in _stats(values)])

For the record, here are the parameters of this little benchmark:

The server is a 4Gb DigitialOcean server that is being used but not super busy
The PostgreSQL server is running on the same virtual machine as the Python process used in the benchmark
Django 1.10, Python 3.5, PostgreSQL 9.5.4
1,588,258 "Song" objects, 102,496 "Artist" objects
Note that the range of IDs has gaps but they're not counted

The Result

Function	Median	Average	Std Dev
f1	3.19ms	9.17ms	19.61ms
f2	2.28ms	6.28ms	15.30ms

The Conclusion

If you use the median, using select_related is 30% faster and if you use the average, using select_related is 46% faster.

So, if you know you're going to need to do that lookup put in .select_related(relation) before every .get(id=...) in your Django code.

Deep down in PostgreSQL, the inner join is ultimately two ID-by-index lookups. And that's what the first method is too. It's likely that the reason the inner join approach is faster is simply because there's less connection overheads.

Lastly, YOUR MILEAGE WILL VARY. Every benchmark is flawed but this quite realistic because it's not trying to be optimized in either way.

Django test optimization with no-op PIL engine

October 27, 2016
6 comments Python, Django

The Air Mozilla project is a regular Django webapp. It's reasonably big for a more or less one man project. It's ~200K lines of Python and ~100K lines of JavaScript. There are 816 "unit tests" at the time of writing. Most of them are kinda typical Django tests. Like:


def test_some_feature(self):
    thing = MyModel.objects.create(key='value')
    url = reverse('namespace:name', args=(thing.id,))
    response = self.client.get(url)
    ....

Also, the site uses sorl.thumbnail to automatically generate thumbnails from uploaded images. It's a great library.

However, when running tests, you almost never actually care about the image itself. Your eyes will never feast on them. All you care about is that there is an image, that it was resized and that nothing broke. You don't write tests that checks the new image dimensions of a generated thumbnail. If you need tests that go into that kind of detail, it best belongs somewhere else.

So, I thought, why not fake ALL operations that are happening inside sorl.thumbnail to do with resizing and cropping images.

Here's the changeset that does it. Note, that the trick is to override the default THUMBNAIL_ENGINE that sorl.thumbnail loads. It usually defaults to sorl.thumbnail.engines.pil_engine.Engine and I just wrote my own that does no-ops in almost every instance.

I admittedly threw it together quite quickly just to see if it was possible. Turns out, it was.


# Depends on setting something like:
#    THUMBNAIL_ENGINE = 'airmozilla.base.tests.testbase.FastSorlEngine'
# in your settings specifically for running tests.


from sorl.thumbnail.engines.base import EngineBase


class _Image(object):
    def __init__(self):
        self.size = (1000, 1000)
        self.mode = 'RGBA'
        self.data = '\xa0'


class FastSorlEngine(EngineBase):

    def get_image(self, source):
        return _Image()

    def get_image_size(self, image):
        return image.size

    def _colorspace(self, image, colorspace):
        return image

    def _scale(self, image, width, height):
        image.size = (width, height)
        return image

    def _crop(self, image, width, height, x_offset, y_offset):
        image.size = (width, height)
        return image

    def _get_raw_data(self, image, *args, **kwargs):
        return image.data

    def is_valid_image(self, raw_data):
        return bool(raw_data)

So, was it much faster?

It's hard to measure because the time it takes to run the whole test suite depends on other stuff going on on my laptop during the long time it takes to run the tests. So I ran them 8 times with the old code and 8 times with this new hack.

Iteration	Before	After
1	82.789s	73.519s
2	82.869s	67.009s
3	77.100s	60.008s
4	74.642s	58.995s
5	109.063s	80.333s
6	100.452s	81.736s
7	85.992s	61.119s
8	82.014s	73.557s
Average	86.865s	69.535s
Median	82.869s	73.519s
Std Dev	11.826s	9.0757s

So rougly 11% faster. Not a lot but it adds up when you're doing test-driven development or debugging where you run a suite or a test over and over as you're saving the files/tests you're working on.

Room for improvement

In my case, it just worked with this simple solution. Your site might do fancier things with the thumbnails. Perhaps we can combine forces on this and finalize a working solution into a standalone package.

hashin 0.7.0 and multiple packages

August 30, 2016
0 comments Python

My colleague Andrew Halberstadt stepped up with a great contribution on hashin (on PyPI). Now you can install multiple packages in one sweep. Like this:

$ hashin requests Django premailer mincss

And if you need to specify a different requirements file than the default (./requirements.txt) or a different algorithm than the default (sha256) you can do that like this:

$ hashin requests Django premailer mincss --algorithm=sha512 --requirements-file=dev/reqs.txt

$ hashin requests Django premailer mincss -a sha512 -r dev/reqs.txt

This is an important change if you were used to typing:

$ hashin somepackage dev/reqs.txt

...because if you continue to do that it's going to try to fetch the hash for a PyPI package supposedly called "dev/reqs.txt".

Thanks @ahal!

Note! The operation is not atomic. So if you do hashin requests somejunk it will hash in the latest requests to your requirements.txt and error on the second one.

django-html-validator - now locally, fast!

August 12, 2016
1 comment Python, Web development, Django

A couple of years ago I released a project called django-html-validator (GitHub link) and it's basically a Django library that takes the HTML generated inside Django and sends it in for HTML validation.

The first option is to send the HTML payload, over HTTPS, to https://validator.nu/. Not only is this slow but it also means sending potentially revealing HTML. Ideally you don't have any passwords in your HTML and if you're doing HTML validation you're probably testing against some test data. But... it sucked.

The other alternative was to download a vnu.jar file from the github.com/validator/validator project and executing it in a subprocess with java -jar vnu.jar /tmp/file.html. Problem with this is that it's really slow because java programs take such a long time to boot up.

But then, at the beginning of the year some contributors breathed fresh life into the project. Python 3 support and best of all; the ability to start the vnu.jar as a local server on http://localhost:8888 and HTTP post HTML over to that. Now you don't have to pay the high cost of booting up a java program and you don't have to rely on a remote HTTP call.

Now it becomes possible to have HTML validation checked on every rendered HTML response in the Django unit tests.

To try it, check out the new instructions on "Setting the vnu.jar path".

The contributor who's made this possible is Ville "scop" Skyttä, as well as others. Thanks!!

How to identify/classify what language a piece of text is

August 9, 2016
0 comments Misc. links, Python

Suppose you have a piece of text but you don't know what language it is. If you speak English and the text looks English, it's easy. But what about "Den snabba bruna räven hoppar över den lata hunden" or "haraka kahawia mbweha anaruka juu ya mbwa wavivu" or "A ligeira raposa marrom ataca o cão preguiçoso"? Can you guess?

MeaningCloud can guess. They have a Language Identification API that you can use for free. Their freemium plan allows for 40,000 API requests per month.

So to get started, you have to register, verify your email and sig in to get your "license key". Now when you have that you simply use it like this:

>>> import requests
>>> url = 'http://api.meaningcloud.com/lang-1.1'
>>> payload={'key': 'b49....................ee',
... 'txt': 'Den snabba bruna räven hoppar över den lata hunden'}
>>>
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39999', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sv', 'da', 'no', 'es']}
>>>

If you look at the lang_list list, the first one is sv for Swedish.

If you want the full name of a language code, look it up in the "ISO 639-1 Code" table.

Let's do the other ones too:

>>> payload['txt'] = 'A ligeira raposa marrom ataca o cão preguiçoso'
>>> # Portugese
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '39998', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['pt', 'ro']}
>>> payload['txt'] = 'haraka kahawia mbweha anaruka juu ya mbwa wavivu'
>>> # Swahili
>>> requests.post(url, data=payload).json()
{'status': {'remaining_credits': '37363', 'credits': '1', 'msg': 'OK', 'code': '0'}, 'lang_list': ['sw']}

The service isn't perfect. It struggles on shorter texts using non-western alphabet. But it's pretty easy to use and delivers pretty good results.

UPDATE

Note! If you intend to do this in bulk and you have access to Python and NLTK use this script instead.

I tried it on my nltk install and I have 14 languages that it can detect.

UPDATE 2

A much better solution than NLTK is guess_language-spirit. It's superfast and I spotchecked a bunch of its outputs and put the non-English text into Google Translate and a it almost always gets it right.

json-schema-reducer

August 2, 2016
0 comments Python

Last week I made a little library called json-schema-reducer. It's a simple function that takes a JSON Schema and dict (or a JSON string or a .json file path), and makes a new dict that only contains the keys listed in the JSON Schema.

This is handy if you have a JSON Schema which dictates what you can/want to share/publish/save, but you have a data structure that contains keys and values you don't want to share/publish/save.

I built this because there are a couple of projects that can turn data structures into models from a JSON Schema but none that have the ability to reduce stuff from a data structure. Here's an example:

Sample JSON Schema (schema.json)

{
    "type": "object", 
    "$schema": "http://json-schema.org/draft-04/schema#", 
    "title": "Sample JSON Schema", 
    "required": [
        "name", 
        "sex"
    ], 
    "properties": {
        "name": {
            "type": "string"
        }, 
        "sex": {
            "type": "string"
        }
        "title": {
            "type": "string"
        }    
    }
}

Sample data structure (sample.json)

{
    "name": "Peter",
    "sex": "male",
    "email": "peterbe@example.com",
}

Usage


>>> from json_schema_reducer import make_reduced_dict
>>> make_reduced_dict('schema.json', 'sample.json')
{'name': 'Peter', 'sex': 'male'}  # Note! No "email" key

The project works in Python 2 and Python 3. See tests.

Also, the function tries to be convenient in that it can accept either a dict, a JSON string or path to a .json file.