Tuesday, August 14, 2007

Referential transparency and memoization

Recently I wrote a Python function which performed some processor-intensive mathematical operations. Its return value depended only on the input parameters and no global state was being read or modified. Moreover, this function was invoked a few 1000 times often with the same set of parameters.

I felt that repeatedly calculating the same values which had already been generated in an earlier invocation was wasteful. So I went for the obvious solution of using a cache - with the parameter list as key, and the generated value as the value for the key. Right at the start of the function I check if the cache already contains a mapping for the current input parameters. If yes, I return this cached object. Only otherwise do I calculate a new value, but make sure to add it into the cache before exiting from the function.

Now you can imagine this entire if-else block sitting at the beginning of the function and looking up the cache. Along with a global useCache flag if I ever decide to bypass the cache. Not at all elegant. What if I have to add this feature to other functions tomorrow? The cache lookup code needs to be copied into every such function!

I started to search how other people have solved it. I knew that a function whose return value depends solely on input parameters (with no modification of global state) was called a Referentially Transparent function. So I naturally started at that Wikipedia entry. And lo, in para 3 you have: "referentially transparent functions can be memoized (transformed into equivalent functions which cache results)"

I had come across "memoize" a couple of times before, but only this time it made perfect sense. Wikipedia says about memoization: "It is an optimization technique used primarily to speed up computer programs by storing the results of function calls for later reuse, rather than recomputing them at each invocation of the function"

The article links to a Python recipe for memoization, using the recursive Fibonacci function as example. This is probably one of the best candidates for this technique: Since fib(n) is given by fib(n-1) + fib(n-2), you can pick up the stored values of fib(n-1) and fib(n-2) from the last 2 invocations of fib.

The key line is where a normal fib function is replaced by its memoized equivalent as fib = Memoize(fib). That adds the caching layer, much like a Decorator pattern in Java.

Well, the Decorator hints that you can use a Python decorator for the same. So here's a simple memoization decorator in Python, based on code from another recipe:

class Memoize:
def __init__(self, fn):
#Store the original function and create a cache
self.fn = fn
self.cache = {}

# __call__ is invoked when the class name is used like
# function
def __call__(self, *args):
if args in self.cache:
return self.cache[args]
else:
#Not cached. Call original function and
#add value to cache
result = self.fn(*args)
self.cache[args] = result
return result

# The decorator - returns a Memoized instance
# created from input param
def memoize(func):
return Memoize(func)

# Use the decorator on the fibonacci function
@memoize
def fib(n):
if n == 0:
return 0
elif n == 1 or n == 2:
return 1
else:
return fib(n-1) + fib(n-2)

# If I remove the decorator, fib() performs
# much slower for n > 30 on my machine
for i in range(0, 50):
print fib(i)

This simple decorator does not work if any of the input parameters is a list or any other unhashable type. It also does not work for instance methods of an object, only global functions. At this point I don't need either and I don't know enough Python also :-) You can certainly find something better on the Net.

No comments: