python: sort by partially reversed key tuple
this problem occurs quite common in real applications: given a list of data,
sort them with a key tuple (key_1, key_2, ..., key_n)
where some but not all
of them are reversed;
for example, we may have a list of site configs; each site config has a url, a user and a quota:
items = [
{
'url': 'http://aaa/',
'user': 'xxx',
'quota': 100,
},
{
'url': 'http://aaa/',
'user': 'yyy',
'quota': 200,
},
{
'url': 'http://bbb/',
'user': 'xxx',
'quota': 100,
},
{
'url': 'http://bbb/',
'user': 'yyy',
'quota': 200,
},
]
if we want to sort by (url, quota)
in asc order, we can write:
from operator import itemgetter
ans = sorted(items, key=itemgetter('url', 'quota'), reverse=False)
if we want to sort by (url, quota)
in desc order, we can write:
ans = sorted(items, key=itemgetter('url', 'quota'), reverse=True)
but what if we want to sort by (url, quota)
, with url
in asc order but
quota
in desc order? the reverse
parameter no longer works here;
clever people will notice quota
is a number, so we can write:
ans = sorted(items, key=lambda x: (x['url'], -x['quota']), reverse=False)
but what if we want to sort by (url, user)
, with url
in asc order but user
in desc order? this time user
is not a number so the minus sign wont work any
more;
solution
the general solution is to write an advanced sorted
function, which takes in a
richer format of keys:
def my_sorted(items, keys):
class _inv:
def __init__(self, val):
self.val = val
def __eq__(self, other):
return other.val == self.val
def __lt__(self, other):
return other.val < self.val
def _key(item):
ans = []
for key, reverse in keys:
val = key(item)
ans.append(_inv(val) if reverse else val)
return ans
return sorted(items, key=_key)
with this advanced function, we can write:
ans = my_sorted(items, keys=[
(itemgetter('url'), False),
(itemgetter('user'), True),
])
output:
[
{
"url": "http://aaa/",
"user": "yyy",
"quota": 200
},
{
"url": "http://aaa/",
"user": "xxx",
"quota": 100
},
{
"url": "http://bbb/",
"user": "yyy",
"quota": 200
},
{
"url": "http://bbb/",
"user": "xxx",
"quota": 100
}
]
this approach is general enough in that it works with all data types supporting
__eq__
and __lt__
(except for the _inv
class itself);
using a key adapter
we can make the above approach even more general by organizing everything inside
a key adapter so that it not only works with sorted
but also with list.sort
or anywhere a key is needed:
keys = [
(itemgetter('url'), False),
(itemgetter('user'), True),
]
def adapter(keys):
class _:
def __init__(self, v): self.v = v
def __eq__(self, o): return o.v == self.v
def __lt__(self, o): return o.v < self.v
return lambda x: [ _(k(x)) if r else k(x) for k, r in keys ]
items.sort(key=adapter(keys))
the adapter fits nicely into the current sort interface where a key is expected; you are using the same sort function so that client code needs minimal change;
multi-pass sort
the approaches above perform single-pass sort by generating a tuple key from a key tuple; there is another multi-pass approach to solve the same problem; the key insight is that sorts are guaranteed to be stable in python3 (at least >= python3.5 per doc); so a sort on multiple columns can be broken into a series of sorts on each column in reverse column order:
for k, r in reversed(keys):
items.sort(key=k, reverse=r)
remember: for this approach to work, sort must be stable; this is true for builtin sort algos, but be careful if you use sort algos from other sources;
performance
in general, single-pass sort and multi-pass sort have the same worst-case time
complexity; assuming there are n
items and each has dimension m
, then both
have worst-case time complexity O(m * n * log n)
; which is faster depends on
the input; on different inputs, one can be 80%-100% faster than the other:
https://bugs.python.org/file47877/performance-1.py
however, in the above single-pass sort impl, there is a performance hit with
function adapter
due to the large amount of wrapper objects; this is really a
python artifact than a algo defect; the algo merely needs a piece of small info
about which columns should be sorted in reverse order; using wrapper objects
makes the code neat but creating a large amount of objects can be a performance
overkill;
there is another single-pass sort algo proposed here; it wraps whole tuples instead of scalars in the tuple; a performance evaluation is here: https://bugs.python.org/file47880/performance-2.py
conclusion
-
the single-pass sort algo using adapter pattern introduced here features a simple calling interface supporting both scalar and vector keys; it adapts the key instead of introducing another sort function, which improves code readability and minimizes code change when the requirement varies; but be aware that performance may degrade (due to wrapper objects) as data scales;
-
the multi-pass sort algo may run faster when you have a large amount of data (not guaranteed, as it is input-dependent); but you need to make sure the underlying sort algo is stable and passes go in reverse column order; you may write it as another function (such as
my_sort
) and call it instead of builtinsort
; but this can make code unnecessary complicated;