Last week, a link to the article Closures And Objects Are Equivalent was posted on Hacker News. It reminded me of an old university project I did that uses generators to generate closures (pun intended) and add caching mechanisms to online algorithms.
Closures are equivalent to objects
The article explains why closures and objects are equivalent. As a reminder, here is what a closure looks like:
def Counter():
count = 0
def counter():
nonlocal count
count += 1
return count
return counter
The article explains that the closure above is equivalent to the following object:
class Counter:
def __init__(self):
self.count = 0
def __call__(self):
self.count += 1
return self.count
You might argue that the object allows to access the value of count
directly, while the closure does not. This can be fixed:
def Counter():
count = 0
def counter(increment=True):
nonlocal count
count += increment
return count
return counter
Now, counter(False)
will return the current value of count
without incrementing it.
You might also argue that the object can implement multiple methods, while the closure can only implement one. While this is true, the closure can receive any argument and return any value, so it can be used to implement multiple methods:
def Counter():
count = 0
def counter(name="increment", *args, **kwargs):
nonlocal count
if name == 'increment':
count += 1
return count
elif name == 'decrement':
count -= 1
return count
elif name == 'get':
nonlocal count
return count
else:
raise ValueError(f'Unknown method: {name}')
return counter
You can see how one would implement any “object” with a closure. Of course just because you can does not mean you should.
What about generators?
In general, people will use classes (objects) over closures, as they are more expressive: it is much nicer to write counter.decrement()
than counter('decrement')
.
There is a special use case where a third option exists: iterators and generators.
Iterators are simply objects that implement the __next__
special method:
class Counter:
def __init__(self):
self.count = 0
def __next__(self):
self.count += 1
return self.count
We can call an instance with: next(counter)
. This is equivalent to counter.__next__()
.
Nothing new here, but Python also happens to have a special syntax for iterators: generators. Generators are functions that use the yield
keyword instead of return
:
def Counter():
count = 0
while True:
count += 1
yield count
Generators as improved closures
Generators will produce an iterator when called, but they are more powerful than iterators because the yield
keyword is embedded in the code itself, unlike a return
statement from a __next__
method that just interrupts the execution of the function.
As an illustration, it is very easy to replicate a generator from a class-based iterator:
def generator(IteratorClass):
iterator = IteratorClass()
while True:
yield next(it)
# or simply:
# yield from it
# but that would hide the point
An iterator just has a state that is initialized by __init__
and updated by __next__
. Hence we could easily transform the code above to remove the need for IteratorClass
.
On the other hand, it is much harder from a refactoring point of view to replace a generator with a class-based iterator, aka forget about the yield
statement, because is it very expressive. yield
does multiple things, among others:
- it tells the interpreter that the function is a generator, and that calling it will return an iterator with a
__next__
method instead of executing its code - it allows to return a value from the generator and resume execution at multiple points in the code
- it also allows the calling frame to send a value and even to throw an exception
The last feature is the most powerful and unknown one. As such, generators are closures since they store both a function and an environment. But they are also more expressive than closures because they can resume execution at multiple points in the code: the instruction pointer is part of the state.
The point I wanted to make is that automatically capturing the environment of a function including the instruction pointer is a very powerful feature because it can semantically transform any function into a closure. This feature can be useful in some cases.
A practical example: AWESOME ONLINE MEMOISED DIJKSTRA
The best use I have seen of generators was in a university project I did 6 years ago: we had to compute a bunch of shortest paths in a large graph, during the computation of reach. We were given queries with starting and ending nodes, but we could not predict them in advance. However, the starting nodes were all close to the node that we wanted to compute the reach of, hence there were not a lot of them.
The assignment was quite computationally intensive and advised to use C++ or Java.
Since the graph was quite big (4.1M vertices, 9.4M directed edges), the only algorithm that could be used was Dijkstra’s algorithm.
I tried two optimizations that worked well but did the exact opposite:
- Return the result when reaching the target node
- Compute the shortest paths to all nodes and cache the result
But what if we could continue computations at the last known Dijkstra’s state?
That would give the best of two worlds (if given enough memory): no computation would be ever wasted since new queries would resume from the last known state, and the result would be returned as soon as possible.
To implement it, I could either:
- create a class, list all variables of my Dijkstra’s, and implement a
query
method that would resume the computation - use an iterator that looks very much like the textbook Dijkstras’s and use the
send()
Python method to pass queries, while the state variables AND the instruction pointer are stored in the closure
Of course, the difference is not that big, but the second option is much more elegant and concise. The resulting code of an “AWESOME ONLINE MEMOISED DIJKSTRA” as I wrote in the docstring back then is stupidly small and simple to read:
@memoize_generator
def dijkstra_with_target(graph, start):
target = yield
dist = {}
ans = None
Q = [(0, start)]
while Q:
while ans is not None:
target = yield ans
ans = dist.get(target)
d, s = heappop(Q)
if not s in dist:
dist[s] = d
if s == target:
ans = d
for v in graph[s]:
if not v in dist:
# facultative condition for optimisation
heappush(Q, (d + graph[s][v], v))
while True:
target = yield dist[target]
The generator is produced with gen = dijkstra_with_target(graph, source)
and can be called with gen.send(target)
.
Since the generator is memoized, we can always call the same instance with a one-liner: dijkstra_with_target(graph, source).send(target)
.
I reproduced my implementation of memoize_generator
below. The graph is not a hashable object so we have to remember it by its id
.
def memoize_generator(f):
@wraps(f)
def g(*args):
targs = tuple(a if isinstance(a, Hashable) else id(a)
for a in args)
if not targs in g.mem:
ans = f(*args)
next(ans)
g.mem[targs] = ans
return g.mem[targs]
g.mem = {}
def clean():
for i in g.mem:
g.mem[i].close()
g.mem = {}
g.clean = clean
return g
Note the first call to next
that is used to initialize the generator. This is because else the first call to send
would throw TypeError: can't send non-None value to a just-started generator
.
I also included a clean function that forces the generators to release their memory. Note that g.mem = {}
would be enough if the generators are not referenced anywhere else, since it calls the __del__
method of g
, which calls the __del__
method of the generators, which is a wrapper for close()
as PEP342 states.
In the end, my Python code (executed with PyPy) outperformed all C++ and Java implementations by an order of magnitude.