In the previous post of this series, we introduced you to the API we chose to perform data scraping, we described the key part of the data scraping scripts one of us (Bastien) programmed, and we provided the link to the GitHub Repo where the scripts are available. If you haven’t already done so, we suggest you read that post and the one before as it may help you better understand the content of this one.
In the third and last post of this series, we will show you how to use the scripts Bastien programmed with only minimal programming knowledge.
The scripts are written in Python, a free and open-source programming language. The code should run fine on any device (whether it is a Windows, Mac, or Linux system). However, you may face errors when using the scripts on a non-Windows system (as they have been specifically programmed for use on a Windows System). In case you use the scripts and you don’t get it to work, don’t hesitate to contact Bastien (he even helps Mac users). Here are the three steps you have to take before using the scripts:
- Install Python – If you don’t have Python on your device, you can download its last version from this page. We recommend you run through the installation with default settings.
- Download the required Python modules – Some modules (if you are familiar with R, modules are comparable to packages) that are not included in the default Python environment are needed for the script to run: requests, pandas, and pause. You need to install these for what we will do. The Python documentation provides enough information on how to do so.
- Download the scripts – The scripts to do the web scraping are deposited on the following GitHub Repo. Download the repository as a .zip file and extract it where you prefer to store it.
- Subscribe to the Aerisweather API – Aerisweather currently offers different packages whose price ranges from $80 to $690 monthly. You can find a breakdown of the different packages they offer here. In order to collect historical weather data, you’ll need a package with the “Archive add-on” selected. If you don’t have the funds and want to combine forces with other readers of this blog, leave a comment and you can coordinate with each other.
Preparing your dataset
You will need a dataset on the basis of which you will scrape data. You will need to prepare the dataset before you start the web scraping. The way the scripts are programmed will require you to transform your dataset from long into wide format, if it is not already (that is, a single participant will occupy a single line in your dataset). Your dataset must also be saved in a .csv format, with the separator being “;”.
In order to collect weather data for your dataset, some information is mandatory for each participant:
- id – This corresponds to the unique identifier of your participant. It can simply be a series of letters/numerical values. For instance, “participant_1”.
- location – This corresponds to the latitude/longitude combination of the location for which you want to collect weather data (for instance, the location where the participant completed your study). Each value must have the following format: “latitude,longitude”. Example for Paris: “48.856613,2.352222”. Different tools are available online to estimate the latitude/longitude coordinates of a location (e.g., https://www.latlong.net/). If you need to retrieve a massive number of coordinates (as it has been the case for us), you’ll probably have to figure out a way to do this in an automated way. To give you some ideas of how you can do this, check the following GitHub repo. It contains an annotated script (as well as sample files) that Bastien programmed to map 11,000 US zip codes to latitude/longitude coordinates.
- date – This corresponds to the date for which you want to collect weather data (for instance, the date during which the participant completed your study). For our script, we based ourselves on a specific date and time. For that reason, each value must have the following format: “DD/MM/YYYY HH:MM”. Example: “21/06/2020 16:20”. Importantly, the time zone in which the date is saved must be the time zone of the location for which you want to collect weather data.
Collecting the data
After you have downloaded the files, you will have to locate the “aerisweather_keys.py” file and edit it with the text editor of your choice. You’ll have to replace the default value of different variables by the value that matches your setup. Here is a breakdown of the different variables that you’ll have to edit (details on the format that each variable must take are documented in the “aerisweather_keys.py” file):
- CLIENT_ID: Id key of your Aerisweather account.
- CLIENT_SECRET: Secret key of your Aerisweather account.
- DATAFILE: Name of your datafile.
- PARTICIPANT_ID_COLUMN: In your datafile, name of the column in which is documented the identifier of the participant.
- TIMESTAMP_COLUMN: In your datafile, name of the column in which is documented the date for which you want to collect weather data.
- EXACT_LOCATION_COLUMN: In your datafile, name of the column in which is documented the latitude/longitude combination of the location for which you want to collect weather data.
- UTC_TIMEZONE: UTC timezone of the time displayed on your computer.
- YEAR: Year for which you want to collect the “yearly averages data” (see below).
- SUMMARY_FOI: Weather variables of interest for the “yearly averages data” (see below).
- ARCHIVE_FOI: Weather variables of interest for the “specified date data” (see below).
Once you have set the values that match your setup, you are ready to collect your weather data.
You can now run the “aerisweather_run.py” script. You will be prompted to choose the type of data you want to collect:
- Specified date data
- The script will collect the weather data of the day documented in the “date” column of your dataset (“ontime” timing), for each participant in your dataset. It will also collect weather data of the day prior this day (“day-1” timing), and the second day prior this day (“day-2” timing). Data will be saved in a “results” folder (1 file per timing), with the filename format being “specified_date_TIMING_PARTICIPANTID.json”. For instance, “specified_date_ontime_participant_1.json”
- Yearly averages data
- The script will collect the weather data of the 12 months of the year you specified in the “aerisweather_keys.py” file, for each unique location in your dataset. Data will be saved in a “results” folder (1 file per month), with the filename format being “averages_YEAR_LOCATION_FIRSTDAYOFMONTH_LASTDAYOFMONTH.json”. For instance, “averages_2019_48.856613,2.352222_2019-01-01_2019-01-31.json”
Once you have chosen the type of data you want to collect, the final step is to choose the number of threads (between 1 and 10) that the script will run simultaneously. This corresponds to the number of accesses to Aerisweather that will be made simultaneously. Unless you are doing some testing, we recommend you to go with 10 as it will result in collecting the weather data faster.
Merging the data
If you have successfully collected the data from Aerisweather, you should have a “results” folder filled with .json files. Analyzing the data could be challenging if you don’t process the data.
The “aerisweather_statscomputing.py” script allows you to process and merge the data in one single .csv file (where the separator will be “;”). Of particular import is that you can only run this script if you keep the default variable choices in the “aerisweather_keys.py” file (at least for now).
When running the script, you will be prompted with the message you got when you started the “aerisweather_run.py” script (i.e., you must choose which type of data you want to merge).
Below you can also find the codebook of the resulting .csv file, for each type of data:
- Specified date data (filename format is “analytic_data_averages_YEAR.csv”)
- Yearly averages data (filename format is “analytic_data_specified_date.csv”)
This was the last post of this series on data scraping.
In the last post of this series, we gave you the remaining guidelines on how to use the scripts Bastien programmed to collect your own weather data on Aerisweather.
Finally, if our posts on data scraping sparked any interest in you for that skill, we can only encourage you to jump in. Knowing how to scrape data is certainly a valuable skill at a time where the amount of data available online is increasing exponentially. As long as you feel comfortable with a programming language that allows you to perform web requests and as long as you understand the basic principles of data scraping, you shouldn’t need any specific training to code your first data scraping scripts.
This post was written by Bastien Paris and Hans IJzerman