Blameless Postmortem 1: The Dataviz Disappointment

Athena, have you ever tried to hunt a feather toy, and completely missed it? Maybe you spent a half hour chasing after a toy that you never successfully caught? The experience of failure, in toy-hunting and in software development, can be immensely frustrating, but it’s difficult to learn from these experiences unless you set aside your feelings and consider what went wrong.

IMG_20171116_144130642_BURST000_COVER_TOP

That’s the idea behind a blameless postmortem. A blameless postmortem is a document that a person or team writes in response to a project that just didn’t work. It highlights the things that went wrong, with the intent of learning from those experiences rather than of passing judgment.

Recently I had an experience in which I tried to build a data visualization using a new library, an experience which ended in a fair bit of frustration. So here’s a blameless postmortem describing that experience and what I’m learning from it. For context, I am pulling the format of the postmortem from Dan Puttick’s excellent blog post.

 

Background

I am currently applying to a company (let’s call it TeachCo.) that values education. I wanted to highlight my education experience in my application, so at the suggestion of Jared Garst, I decided to build an interactive data visualization of my teaching experience. I wanted a visualization that could convey how long each teaching experience lasted, the age(s) of the students I taught, and the subject I was teaching. It would also be nice to include information about how many students I was teaching during each experience, and, if relevant, a brief description of the technologies I used. Clearly, this is a lot to include in a visualization.

I’ve had a reasonable background creating data visualizations, but most of them have been within the context of scientific analyses and none of them involved interactive displays. Much of my visualization work involved the robust but sometimes-painful library matplotlib, or for astronomy-related figures, the robust and less-painful library astropy.visualization. The visualization I wanted to present to TeachCo. was fairly different from anything I’ve done before – it involved properly displaying both categorical and numerical data, and it needed to be an exploratory visualization rather than displaying plots for scientific explanation.

 

The Incident

I decided to create a visualization that put dates on the x-axis (allowing me to demonstrate how long I had been teaching) and ages of students on the y-axis, with different colors representing different subjects I was teaching. I included a projection of the x-axis to highlight that I have been continuously teaching since high school, even if specific jobs only occurred on a short-term basis. I knew that seaborn (a wrapper for matplotlib) included a straightforward and nice-looking projection plot object with a fairly simple API. Using seaborn, I created the first draft of my plot.

viz_not_interactive

I next tried to introduce interactivity using mpld3, a library that adds interactive widgets to matplotlib objects. mpld3 works by converting the underlying matplotlib code into d3, a JavaScript library considered to be the gold standard for browser-based data visualizations. mpld3 had exactly what I needed – when a user hovers over an area of the visualization, mpld3 is capable of showing text or of highlighting related data on the visualization. However, as I discovered after searching StackOverflow and the mpld3 issue tracker, mpld3 does not support axis customization. I couldn’t create axis labels and it rendered years as floats, making them much more difficult for a user to interpret.

IMG_20171116_144138112

Aftermath and Response

By the time it became clear that mpld3 would not render my axes correctly, I had spent about four hours working on my visualization. I responded to the problem by spending another two hours trying over and over to make the plot work with some combination of mpld3/matplotlib/seaborn, libraries that were clearly insufficient for the task I was trying to accomplish. I wanted to submit my application to TeachCo. that day, but I would need to use a different library if I wanted to include the visualization in my application. I decided not to give in to sunk cost fallacy and simply submitted the application without the visualization.

 

Ultimate Causes

There were two root causes to this problem: my choice of data visualization libraries and my approach to them.

I chose the combination of mpld3/matplotlib/seaborn because these libraries seemed most familiar to me. I had worked with matplotlib/seaborn before, and mpld3 appeared to modify matplotlib in a fairly straightforward manner. I wanted to submit my data visualization as part of a job application, and I didn’t want to spend too long working on that particular application. However, these libraries did not play well with each other, and did not solve the problems they needed to in order to create a satisfactory visualization. Moreover, mpld3 is not very robust – indeed, the primary developers have abandoned this project in favor of contributing to another data visualization library. Even after this became evident, I kept trying to fix the problem with the incorrect tools rather than using different tools.

The other root cause was that I prolonged a negative attitude toward data visualization. One of the things I enjoy about coding is that if I encounter a problem, I know there is a specific reason why that problem exists, even if I don’t understand the reason. The more I learn about my code or about the library I’m using, the more likely I will be able to solve the problem. But sometimes there is a trade-off between understanding a library and getting a job done quickly, and I notice that trade-off more prominently when using data visualization libraries. Past problems I’ve solved haven’t required much understanding of my visualization libraries and have usually occurred during a time crunch (e.g., I’m trying to finish a paper to submit to an academic journal). So the commands/objects I use to create visualizations feel like black boxes. I don’t understand them very well, and they usually frustrate me. Rather than rethinking that approach to data visualization libraries, I kept being frustrated that the libraries didn’t work the way I wanted them to.

IMG_20171116_144110838_BURST000_COVER_TOP

Analysis and Prevention

The most immediate solution to this sort of problem is to use a different library for interactive exploratory data visualizations. Several Python libraries are specifically designed for this sort of problem, including Bokeh, Dash, and altair. I could even learn some JavaScript and use d3 (or its slightly friendlier cousins, Vega/Vega-Lite) to avoid the problems of a visualization library that may be unfinished or half-baked. Clearly, using the right tools is a more important part of the process than I had previously thought.

One of the lessons I’ve learned as part of this process is that I need to be more careful about making sure that any new library I include in my workflow is robust and well-maintained. I chose mpld3 because it was closest to the library with which I was most familiar, and I didn’t ask the right questions before using it in my code. If I had read the mpld3 issue tracker or documentation in more detail before using the library, it would have been fairly easy to figure out that it was the wrong tool for my project.

A larger problem was that I wasted several hours trying to make my visualization work even after I knew the tool wasn’t right. I didn’t use another library because I assumed a defeatist attitude (e.g., “it is impossible to understand the internal logic of data visualization libraries”), rather than approaching the code with the learning-oriented mindset I apply to other programming tasks.

This experience has convinced me that I need to spend some time truly understanding my data visualization libraries. I find it much easier to learn about my tools when I am trying to solve a specific problem, so I am planning to complete this data visualization task using another library some time in the next few weeks. Stay tuned, Athena!

IMG_20171116_144203948

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s