Generators: A Socratic Inquiry

In this post, Athena, I am going to try to work through some questions I’ve had about generators. I assume you’ve had these questions about generators, too, because you are a cat and it’s my impression that cats spend all day thinking about generators. I can’t imagine what else you would do with all that time you spend lounging under the bed.

Since you’ve spent so much time contemplating generators, I’ll let you ask the questions. Or at least, I assume these are the questions you would be asking, if I could just understand your meowing.

What’s a generator?

Suppose you wanted to write a function that calculates some value and then gives you that value. In Python, the keyword you would typically use at the end of that function is return: the function calculates the desired value and returns it to the user. After the value is returned, the function ends.

But what if instead of ending the function, we simply wanted to pause it and leave open the possibility of asking the function to give us a different value in the future? This is the premise of the yield keyword. When you yield a value, the function does not end, it simply pauses to wait for your next instruction. If you call a function with a yield keyword, it will continue executing until the next yield keyword, at which point it pauses again.

Essentially, a generator is any function that produces a value (or values) via the yield keyword rather than by returning the value (and thus ending the function). In Python, when you call a function containing the yield keyword, the function returns not a value but rather a generator object that can produce your desired value(s) when instructed to do so.

What are the benefits of generators?

In the canonical example of a generator, we could produce the first ten values of the Fibonacci sequence. We create a generator object fib(), and then use the islice() command to tell fib() to give us the first ten Fibonacci numbers.

fib_generators — I first came across this code in James Powell’s excellent PyData 2014 talk “Generators Will Free Your Mind”, which is available on YouTube. The IPython notebooks he used for the talk are available at http://gist.github.com/dutc.

The generator yields the value of the first stored Fibonacci number, and then when it is called again, it increments the values of both stored Fibonacci numbers. Note that after a Fibonacci value is yielded, the generator does not store or keep track of it. It simply produces the values and moves on. This is a HUGE benefit – generators are more memory-efficient than loops because they only produce the values the user wants to produce. For instance, if we only wanted the the twentieth Fibonacci number, the generator wouldn’t store any of the previous Fibonacci numbers, saving a great deal of memory.

Significantly, generators make no assumptions about which values the user would like to receive, but rather give users agency to specify them. James Powell explains the benefits of this setup far better than I could, in his excellent PyData 2014 talk “Generators Will Free Your Mind“.

Why would you want to use generators?

Generators are especially useful when making calls that require us to process a lot of data (but not necessarily store it in local memory). Databases and HTTP requests are probably the two most common such examples. Let’s give an example with HTTP requests, since I’ve discussed that concept in a previous post.

This is a generator that gives the user data from an HTTP request. Each time this generator is called, it makes a server request, and yields data to the user if the request is successful. If called again, it will update the server request parameters (e.g., it will ask the server for a different page of data) and repeat the process. It seems reasonable that we, the user, might not want to request every page available on this server. A generator allows us to specify which pages we want, and what we want to do with each page, rather than storing every page we encounter.

But isn’t that just a glorified for loop?

It kind of looks that way, and it’s often used in conjunction with a for loop, but no. To understand this we have to look at how Python makes function calls, which requires us to understand a computer’s memory.

A computer has two ways to store information: the stack and the heap. The stack follows a first-in-first-out protocol, which means that it acts on the last piece of information it stored. The heap, on the other hand, stores information wherever it happens to have space, and uses that information only when the need arises. This is very abstract, so let’s use an example.

stackheap — Remembering your favorite hiding spots, using a stack vs. a heap.†

Suppose you wanted to store information about your favorite hiding spots, so that you could visit those hiding spots later. You first notice you like hiding under the bed, then that you like hiding behind a chair, and finally that you like hiding in the closet. Later, when scared, you would go hide in the closet, because that is the last hiding spot you remember – this is like the stack. Alternatively, you could just pick the hiding spot that happened to be closest – this is like the heap.†

So how is that related to generators?

I’m glad you asked! Typically, the computer stores function calls in the stack, since it’s important to know the order in which your functions should execute. This is a useful system – when the computer makes a function call, it assigns that function call to a spot in the stack in order to execute it later. But the stack frame – the object that keeps track of the instructions your function gave to the computer – is stored in the heap, to be used only when called upon. Normally when a function gets called, the stack frame gets executed, and then it is removed from the heap because its data have been used already.

For generators, the stack frame is stored but not executed. When you call next() on a generator, the function calls are made in the order specified by the stack frame and the generator yields a result, but the stack frame is not deleted. This means you don’t need to evaluate generators in any particular order. You just call on the generator stack frame whenever you need it.†† Clearly, generators are executed in a different way than functions – and that’s pretty useful.

So, is it true that generators are the greatest?

Yes, Athena. Yes they are.

† This is not a perfect analogy, because the stack and the heap correspond to physical processes in the computer’s memory, and the heap is not random but rather has to do with where there is space in memory for storing an object of a given size. That said, this analogy will give enough of an idea of what’s happening to make sense of generators.

†† If you want more detailed information, check out this blog post about using generators to write an asynchronous web scraper.

I Explain Statistics To My Cat

Sundry musings of an aspiring data scientist.

Generators: A Socratic Inquiry

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply