One of the main research topics in our lab is social thermoregulation. Therefore, much of our research involves the collection of temperature data in various forms (like the participant’s core or peripheral body temperature or the ambient temperature in the lab).
For one of our projects we are conducting this year we focused on a slightly different temperature: historical weather data from over 26,000 locations between the years 2012 and 2020. We also collected weather data for other projects, including projects by the Psychological Science Accelerator. Overall, we needed to collect weather data for more than 10,000,000 datapoints.
Unless you would be willing to retrieve all these data manually, some programming skills are necessary to complete this task. This series of three posts will introduce you to the massive extraction of online data, a practice that can be labeled under the term “web scraping”.
In the first post of this series, we will define the concept of web scraping and introduce you to its key principles. In the second post, we will introduce the API (defined below) that we chose to perform data scraping and discuss the key data scraping scripts (as well as make available a public version of our data scraping scripts), and then in the third post, we will show how to use the scripts even for those minimally familiar with programming.
What is web scraping?
The Wikipedia definition of web scraping is “data scraping used for extracting data from websites”. In other words, web scraping consists in extracting data stored somewhere on some website.
Here is a concrete example: when you go to the site https://www.weather.com, you will get the current temperature and weather conditions of a location. Here was what we got when we went to this site:
Scraping the temperature and weather conditions displayed on this page would consist in extracting the following data: 27°C and Très nuageux (= very cloudy in English).
Extracting these data manually is easy and doesn’t take much effort: you just need to access the URL above and read the page. But now imagine you have to do this 10,000,000 times for very different locations. This can be a royal pain and simply not worth the effort. Data scraping helps you automatize what would have been a colossal task if completed manually.
How can we automate data extraction from a website?
The data you see on a web page are all stored in the source code of that web page (in a more or less transparent manner). Here is a sample of the source code of the weather.com page that we previously accessed (you can see how to access the source code of a web page in Google Chrome here):
As you can see, the temperature and weather conditions data we were interested in (27°C and Très nuageux) are documented in the source code of that web page. Because the information is available in the source code, you can program a script that extracts the data by “parsing” the source code of the web page. Programming a web scraping script requires programming knowledge in a language that allows you to perform Internet requests (like python or R).
Extracting data from the source code of a web page is possible because the source code of a web page tends to have the same structure over multiple accesses. Compare it to a house: if we built various clones of your house, you don’t need a map to find your way to the bathroom in one of the clones, meaning that you don’t have to worry in case you have an emergency.
You can try for yourself by searching for weather data for another place in the search bar of https://www.weather.com and you’ll find that temperature and weather conditions are both documented in the source code of the web page in a very close manner across accesses (again, see here how to find the source code of a web page in Google Chrome):
The role of an API
Perhaps you may have gotten the impression that web scraping is not always an easy task. The reason why we phrased it the way we did is that most online services are not exactly thrilled to share their data without financial compensation. This leads most of them to block data scraping attempts.
Here are a couple roadblocks you can encounter when scraping the content of web pages:
- The structure of the source code can vary from access to access, with variable names being generated dynamically.
- The website can detect and block your scraping attempts by capping your number of access per minute (when not blocking all accesses).
In our case, extracting massive historical weather data required us to subscribe to an Application Programming Interface (better known as API).
To put it short, an API is an interface between you (counterpart A) and the service from which you want to extract data (counterpart B). This interface is built to ease the communication between the two counterparts, with communication means standardized for both asking and receiving the data.
Different APIs exist for the same information (like the weather). Their prices vary depending on multiple factors such as the quality of the data provided or the number of accesses granted within a given time period.
In this post, we described the basic principles of web scraping and how to perform it.
In the second post of this series, we will introduce you to the API we chose to perform data scraping and we will describe the key part of the data scraping scripts one of us (Bastien) programmed.
This post was written by Bastien Paris and Hans IJzerman