top of page

Natural History Museum - Dinosaurs

Platforms used: Python, BigQuery, Tableau

Python: Web Scraping 

I love dinosaurs. I absolutely loved the Jurassic Park movies (the originals, not so much the new ones), so I wanted to do a fun project related to dinosaurs. I didn't want to pull a free dataset from Kaggle, I wanted to create my own.
 
So, for this project, I thought it would be fun to scrape the Natural History Museum's website for information on every dinosaur they list.
 
Before starting, here are a couple of things I needed to consider:​
  1. There are many many dinosaurs, and each dinosaur is on a separate URL. I will have to run code to pull each dinosaur name, and add it to the end of a base URL to get the correct ones.
  2. Some of the data I would like to include live within the same 'div' classes on the HTML of the dinosaur URL.
  3. Not all data on each page is the same. I will need to account for any data that is pulled that is of different types, such as "NoneType".  
  4. Once I extract the data, I would like to import it into BigQuery to store it. I will have to first add it to a pandas data frame, then import from pandas into my data table in BigQuery.
  5. From there, since Tableau Public does not support BigQuery, I will use SQL to extract the data I would like to visualize, then download it into an excel. Otherwise, I would link this BigQuery table directly to Tableau.
First Step: Scrape from every dinosaur URL available, and push dataframe to BigQuery
  1. Import packages needed: BeautifulSoup, Selenium, ChromeWebDriver
  2. Each dinosaur URL has the same base URL, "https://www.nhm.ac.uk/discover/dino-directory/"; then has the name of the dinosaur at the end, like so: "https://www.nhm.ac.uk/discover/dino-directory/aardonyx.html"
  3. I didn't want to have to manually enter each dinosaur URL into my code, so I found the dinosaur directory here: "https://www.nhm.ac.uk/discover/dino-directory/name/name-az-all.html"; which lists all the dinosaur names, and each is linked with their URL. 
  4. So, I decided to run code to find all the dinosaur name links in the HTML using 'ul.dinosaurfilter--dino-list a', extracted each URL by getting the 'href' attribute value. Then, I added the base URL "https://www.nhm.ac.uk" to each dinosaur URL to get the full URLs.
  5. Then, I found within the website the 'div' classes that I would need. This is where I noticed that some live within the same 'div' class. I found a workaround for length and type of dinosaur. It will look for an element named 'length' and it labels that as 'length-element'. Then, it checks if it exists, if not, it will assign it as an empty string. If it exists, it finds the text of the sibling element next to 'length-element' and assigns it to 'length'. Same thing for 'dinosaur_type'. 
  6. Then, I add the extracted data into a dictionary to prep for a pandas data frame. 
  7. Last step is to push this pandas data frame into BigQuery, with 'import pandas_gbq". 
Python Code:
Python 1.png
Python 2.png
Python 3.png
Python 4.png
Second Step: Create BigQuery Table and Clean Data with SQL
  1. First, I built a new project, data set, and table in BigQuery.
  2. Then, I built the schema for the table. I kept all fields as string fields and all fields as "nullable" except for the Dinosaur name, which was required. I didn't want data that wasn't tied to a dinosaur name, given that they had to come from separate URLs. 
  3. Ran the query as is, and looked for data in the preview that needed to be updated: 
    1. Several columns needed proper title case (Diet, etc.)
    2. I wanted to separate out the names from the year discovered, and add a separate column for "year". Upon import, the data in the column looked like Last Name (1994). 
    3. I used a Regex function to remove 'm' from length, and punctuation in the Name Meaning column.

 

Calling the table as is, with the following SQL code left me with the following results: 

SQL1.png
Data1.png
Prior to exporting it into an excel file (which, by the way, I wish Tableau Public connected to BigQuery!) I had some cleaning to do. See below for my code and output. You'll notice the length has the 'm' removed, the Type_of_Dinosaur is in titlecase, and Named_by column has the year separated out into it's own column. This will help with data visualization. 
SQL2.png
Data2.png
From here, I exported to Google Sheets, and downloaded as an excel file. As I mentioned, if this was a very large dataset or if Tableau Public was able to connect to BigQuery, I would connect BigQuery directly to Tableau to avoid this step.

Tableau Dashboard: Data Visualization

Please click this link to navigate to Tableau Public, where you can see the correct formatting of the dashboard. The embedded version changes the formatting, but it is here for easy reference.

Key Insights from this data set: 

  • From 1826-2017, there were 309 dinosaurs discovered.

  • Of all these dinosaurs, 60.2% were herbivorous and 30.1% were carnivorous. 

  • 57 dinosaurs were discovered in China, 50 in the USA, and 38 were found in Mongolia. These are the top 3 countries by dinosaurs found.

  • 1979 & 2006 had the highest number of dinosaur discoveries per year, with 11 dinosaurs found each year.

  • 140 out of 309 dinosaurs discovered (45.3%) were from the Late Cretaceous period.

  • 114 out of 309 dinosaurs discovered (36.9%) were large or small therapods.

  • The 3 largest dinosaurs discovered by length in meters was the Patagotitan (37.5 meters), Supersaurus (35 meters), and the Argentinosaurus (35 meters). 

bottom of page