In the previous post of this series, we defined the concept of data scraping and we introduced you to its key principles. If you haven’t already done so, we suggest you read this previous post as it may help you better understand the content of this one.
In the second post of this series, we will introduce the API we chose to perform data scraping, we will describe the key part of the data scraping scripts one of us (Bastien) programmed, and provide the link to the GitHub Repo where the scripts are available.
AerisWeather provides an API that satisfied all our needs for our research projects:
- Their API provides historical weather data from 2011 onwards for various weather variables (e.g., temperature, wind, sky coverage, or humidity).
- Their API supports most locations across the world.
Importantly, their pricing is one of the most competitive: Our $245 subscription plan allowed us to collect weather data for three different research projects (for a total of more than 75,000 participants and more than 10,000,000 datapoints), and we were still far from the use limit set by this plan. Note that it is very important to evaluate pricing before programming, as there can be wide ranges in the prices suppliers set.
Once we found the API, the next step was to program a script that would communicate with the API to collect the data we needed. In order to facilitate this task, API providers always share some documentation on how to use their API (e.g., https://www.aerisweather.com/support/docs/api/).
Below we describe the key part of the data scraping scripts Bastien programmed to collect data from AerisWeather.
Collecting the data and saving them
Below is a sample of the dataset for which we want to collect weather data:
You can see that each participant is associated with a date (start_date variable, which is the time at which the participant started the study) and geographic coordinates (geo_coordinates variable, which is the latitude and longitude combination of the place where the person participated in the study).
Let’s say that, for each participant, our goal is to retrieve the weather data of the day during which the participant completed the study, for the location to which the participant is associated.
Our script will have to repeat one key step for each participant: collect and save the weather data for the time/location combination associated with the participant.
To collect these data, a brief look at the API documentation suggests the use of the following URL structure:
https://api.aerisapi.com/observations/archive/LOCATION?from=DATE&fields=VARIABLES&client_id=ID_KEY&client_secret=SECRET_KEY (this thus means that the URL contains the latitude/longitude combination for “LOCATION”, where “DATE” is obvious, and “VARIABLES” are the variables that we want to extract from Aerisweather’s database).
Below is the URL that will return the weather data of interest for participant_1 (of course, you could have picked different variables; these are the ones we determined of interest for our projects. For the list of variables and what the abbreviations below mean, see here):
Below is a sample of the data for our participant from Cardiff on January 27th, 2021 when accessing the URL above:
AerisWeather API’s documentation describes how the data are structured. In short, the data is returned in what is called a “JSON format”, which is a standard format to store data in a structured fashion (like XML).
Once the data is returned to you, you just need to save them on your computer with a filename format that your script will apply for each file (so that you can easily find the data you’re interested in). In that case, saving them with the following filename format “PARTICIPANTID_TIMING.json” (e.g., “participant1_DayOfDataCollection.json”) would seem appropriate.
Now that we have the method to collect and save the weather data for one participant, we can just apply this method to all participants by using loops.
In this post, we introduced you to the API we chose to perform data scraping and we described to you the key part of the data scraping scripts Bastien programmed. If you want to download these scripts, you can do so from our GitHub repo. When you have downloaded them, you are ready to read the final, third post where we will show you how to use these scripts with only minimal programming knowledge.
This post was written by Bastien Paris and Hans IJzerman