Last post, Athena, we talked about building a recommender system. I used a very small database for that recommender system in my example, because I was mostly trying to test to see that the recommender system was working. But where do you go to obtain real-life data? If that’s what you’re looking for, read on to find out about:
- Possible sources of interesting data sets
- How to access servers that have a public database
- A short introduction to using Python’s requests library
If I know anything about you, Athena, I know that you are obsessed with head scratches. As soon as an amiable human arrives in the house, you are waiting in expectation for some good head scratching, the more the better. Likewise, for the typical data scientist, more data are usually better.†
Data scientists may find their data sets from a variety of different sources. For one, we may have collected the data ourselves! This is convenient because we know exactly what sorts of assumptions were made during the data collection, and we know exactly how the data are organized. (We’ll talk more about data organization and cleaning in a future post, although that’s a topic worthy of books.) Or we may have employers or co-workers who have collected the data for us, and it’s our job to clean up and analyze the data.
But what if we don’t own a data set, or otherwise have immediate access to it? If the data set is online, there are two primary ways to obtain it: through web scraping and through requesting the data from a server. Web scraping involves reading the data directly from a web page; we will save that topic for another post. Today we will focus on requesting data from a cooperative server using HTTP requests.
First let’s define some terms. HTTP stands for “HyperText Transfer Protocol”. This is the language of servers and websites talking to each other. An HTTP request is another way of saying “I’m asking your server for information”. Let’s go back to our analogy of head-scratching. Suppose you have had a long day in the house, and then suddenly, one of your humans comes home! You want to have your head scratched, so you might make a request to that human by walking up and meowing at them. But you don’t yet know how they will respond to your request, or whether they understood your request, or even if they heard it. That is like making an HTTP request – you have to wait for the server to respond before you know whether your request was understood.
The response is called the status code. A status code might let you know that your request was successful, that your request was denied, or that your request format is broken, among a number of possible messages. Here are some ways I’ve seen a server respond to an HTTP request:
- Status code 200 OK: your server request was successful, and the server will respond by giving you data! This is like if the human you meowed at turns around and starts immediately scratching you.
- Status code 400 Bad Request: your request was formed improperly and the server didn’t understand what you meant. If you meow at a human and they act confused, that is like a status code 400. You might need to form a different request in a form the server can understand, analogous to meowing and rolling over to show off your belly.
- Status code 401 Unauthorized: the server understood your request, but you don’t have permission to access those data. Fortunately, you’ve never encountered the analogous head-scratching situation; you always have permission to ask for head-scratches from me 🙂
- Status code 403 Forbidden: the server understood your request, but those data are forbidden from being accessed. This is like you meowing at me while I am on a video call. I understand that you want me to scratch you, but I won’t do it because my hands are 3000 miles away.
- Status code 404 Not Found: the server understood your request, but still couldn’t find your data. This is like if the human you meowed at turns around to acknowledge you, but their hands are full and they’re not going to scratch you right now. But if they update the server later (or put down their groceries), those data (scratches) might become available.
So now that you understand the procedure for talking to a server, how might we actually write code to do so? In Python, the requests library is a great place to start. We might use code like this to make an HTTP request.
Notice how simple it is! All we have to do is give this function a URL and certain parameters (like our authentication keys and what database page we want). The requests.get() method returns a response object that has a status code (like we discussed above) and, if successful, also has some data to analyze!††
It’s so great to be using a library that makes HTTP requests just as easily as you make head-scratching requests. Without too much trouble, I can write a function that gets me data from an external server – and then I can sit back and relax.
† To be clear, this is NOT universally true! Sometimes people decide to blindly collect data when they should be reconsidering the methods they’re using to analyze those data. No amount of data collection will help them if those data are biased or their data analysis methods are somehow misguided. In analogy to scratching: if someone is trying to scratch your ears when really you want your whole head scratched, more scratches are not better – they’re just not doing it right.
†† Note that in my test_http_request() function, I am calling a function complete_http_request_generators() rather than the function http_request() I showed in the blog. The complete_http_request_generators() function behaves similarly to the http_request() function, except that it correctly handles calling multiple pages of the database. If you are interested in looking at the entire code, check it out on my GitHub.