Fastest way to uniquify a list in Python >=3.6

Saturday, Dec 23, 2017
7 comments Python

URL: https://gist.github.com/peterbe/67b9e40af60a1d5bcb1cfb4b2937b088

This is an update to a old blog post from 2006 called Fastest way to uniquify a list in Python. But this, time for Python 3.6. Why, because Python 3.6 preserves the order when inserting keys to a dictionary. How, because the way dicts are implemented in 3.6, the way it does that is different and as an implementation detail the order gets preserved. Then, in Python 3.7, which isn't released at the time of writing, that order preserving is guaranteed.

Anyway, Raymond Hettinger just shared a neat little way to uniqify a list. I thought I'd update my old post from 2006 to add list(dict.fromkeys('abracadabra')).

Functions

Reminder, there are two ways to uniqify a list. Order preserving and not order preserving. For example, the unique letters in peter is p, e, t, r in their "original order". As opposed to t, e, p, r.


def f1(seq):  # Raymond Hettinger
    hash_ = {}
    [hash_.__setitem__(x, 1) for x in seq]
    return hash_.keys()


def f3(seq):
    # Not order preserving
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()


def f5(seq, idfun=None):  # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker in seen:
            continue
        seen[marker] = 1
        result.append(item)
    return result


def f5b(seq, idfun=None):  # Alex Martelli ******* order preserving
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker not in seen:
            seen[marker] = 1
            result.append(item)

    return result


def f7(seq):
    # Not order preserving
    return list(set(seq))


def f8(seq):  # Dave Kirby
    # Order preserving
    seen = set()
    return [x for x in seq if x not in seen and not seen.add(x)]


def f9(seq):
    # Not order preserving, even in Py >=3.6
    return {}.fromkeys(seq).keys()


def f10(seq, idfun=None):  # Andrew Dalke
    # Order preserving
    return list(_f10(seq, idfun))

def _f10(seq, idfun=None):
    seen = set()
    if idfun is None:
        for x in seq:
            if x in seen:
                continue
            seen.add(x)
            yield x
    else:
        for x in seq:
            x = idfun(x)
            if x in seen:
                continue
            seen.add(x)
            yield x


def f11(seq):  # f10 but simpler
    # Order preserving
    return list(_f10(seq))


def f12(seq):
    # Raymond Hettinger
    # https://twitter.com/raymondh/status/944125570534621185
    return list(dict.fromkeys(seq))

Results

FUNCTION        ORDER PRESERVING     MEAN       MEDIAN
f12             yes                  111.0      112.2
f8              yes                  266.3      266.4
f10             yes                  304.0      299.1
f11             yes                  314.3      312.9
f5              yes                  486.8      479.7
f5b             yes                  494.7      498.0
f7              no                   95.8       95.1
f9              no                   100.8      100.9
f3              no                   143.7      142.2
f1              no                   406.4      408.4

Not order preserving

Order preserving

Conclusion

The fastest way to uniqify a list of hashable objects (basically immutable things) is:


list(set(seq))

And the fastest way, if the order is important is:


list(dict.fromkeys(seq))

Now we know.

Comments

Post your own comment

Stefan Pochmann December 26, 2017

What is this mysterious sequence "seq" that you're testing with? Can you please show the *whole* benchmark code?

Peter Bengtsson December 26, 2017

See the link in the top to that gist.github.com

Stefan Pochmann December 26, 2017

Thanks, missed that since it's "above the article". Other things:

- You say in Python 3.7 "order preserving is guaranteed" but following that link I only see Raymond saying "a guarantee for 3.7 is almost inevitable". Am I missing something again or are you overstating that?

- Better remove f2 from the list of functions to test, as it's suuuper slow.

- Would be good to include *min* as well, in addition to mean and median. As Veedrac argued at https://stackoverflow.com/a/24105845/1672429 - "all error in timing is positive [so] the shortest time has the least error in it". And *Raymond* uses min there as well (in his answer, the accepted one).

- I came up with four order-preserving solutions slower than Raymond's but much faster than the others. Then I also found them in the comments of the old blog post, ha. Here they are:

    def f13(seq):
        seen = set()
        add = seen.add
        return [add(x) or x for x in seq if x not in seen]
        return [add(x) or x for x in seq if not x in seen]
        return [x for x in seq if not (x in seen or add(x))]
        return [x for x in seq if x not in seen and not add(x)]

Stefan Pochmann December 26, 2017

- f1, f3 and f9 aren't quite right. They don't return a list but a "dict_keys" view object, which isn't fair, as that takes only O(1) time.

- What makes you think f9 doesn't preserve order? I don't see why it wouldn't, and it did preserve order in some tests I just did, including your script (after fixing the "dict_keys" issue by wrapping "list(...)" around it).

- In the comments under Raymond's answer https://stackoverflow.com/a/8220943/1672429 he also very strongly suggested using *min*: "Use the min() rather than the average of the timings. That is a recommendation from me, from Tim Peters, and from Guido van Rossum. The fastest time represents the best an algorithm can perform when the caches are loaded and the system isn't busy with other tasks. All the timings are noisy -- the fastest time is the least noisy. It is easy to show that the fastest timings are the most reproducible and therefore the most useful when timing two different implementations."

Michael Kennedy December 26, 2017

Nice analysis. f7 is now order preserving because it's backed by dictionaries, which are now going to be order preserving.

Michael Kennedy December 26, 2017

Sorry, make that will be... when 3.7 comes out. :)

Meiyer Goren October 15, 2018

Since Python 3.7, insertion-order dict’s are part of the language, as per the “official verdict” by BDFL: https://mail.python.org/pipermail/python-dev/2017-December/151283.html.

Fastest way to uniquify a list in Python >=3.6

Functions

Results

Conclusion

Comments

Related posts