By Elizabeth Slesarev
Don't confuse meteors with meteorites!
"Like meteorites, meteors are objects that enter Earth's atmosphere from space. But meteors—which are typically pieces of comet dust no larger than a grain of rice—burn up before reaching the ground. ... The term “meteorite” refers only to those bodies that survive the trip through the atmosphere and reach Earth's surface." -AMNH
In this tutorial, I will be going over some of python's most useful libraries and tools when it comes to analyzing coordinates or meteorite impact sites. We will also cover some ML methods to analyze our chosen datasets to further develop and test a hypothesis regarding any patterns or relationships we may find.
We will display the phenomena of meteorite landing locations as well as determine if meteorites truly stem from supernovas.
All relevant files, images, and databases will be contained within this repo on GitHub.
Information on DBScan: https://towardsdatascience.com/dbscan-clustering-for-data-shapes-k-means-cant-handle-well-in-python-6be89af4e6ea
Information on KMeans: https://medium.datadriveninvestor.com/weighted-k-means-clustering-of-gps-coordinates-python-7c6270846163
Information on Linear Regression: https://machinelearningmastery.com/linear-regression-for-machine-learning/
Information on meteorites: https://www.amnh.org/explore/news-blogs/on-exhibit-posts/meteor-meteorite-asteroid
Information on databases, libraries and any other methods will be linked as introduced below.
For this project, we will begin by importing the basics such as numpy, pandas, and matplotlib to name a few...
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
We will be collecting data from two sources: NASA's Meteorite Landings Databse which holds data on meteorite landings dating back to the year 1409. The database is a smaller version (ending at the year 2013) of the The Meteoritical Society's more up-to-date one which holds landings up to present time.
The second source will be parsed from later on in the tutorial, is a database containing all identified supernovas (from quite a bit ago) up to the present time. Hosted by SNE Space, this Databse contains relevant information about when supernovas occurred.
All my databases will be imported using CSV files, for ease of parsing.
df = pd.read_csv('Meteorite_Landings.csv', index_col=0)
# dropping all null rows to handle smaller sample size
df = df.dropna()
For this dataset, we will be handling these columns:
name: meteorite name
id: unique meteorite ID tag
recclass: classification of meteorite based on size and other factors
mass (g): weight of the meteorite on Earth
fall: if a meteorite fell and impacted Earth
year: DateTime of year and date the meteorite impacted Earth
reclat: latitude of impact site reclong: longitude of impact site
Here we will parse out the day, month, and time out of the year column because we do not care to look at impacts in such high detail. We will worry about years only for now (which will also be useful for when we import the supernova database).
We will then limit the data to years 1901+ as it is more relevant data.
y = list()
for index, row in df.iterrows():
y.append(row['year'][6:10])
df['years'] = y
df["years"] = pd.to_numeric(df["years"])
df = df.loc[df.years > 1900]
df.head()
id | nametype | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | years | |
---|---|---|---|---|---|---|---|---|---|---|
name | ||||||||||
Aarhus | 2 | Valid | H6 | 720.0 | Fell | 01/01/1951 12:00:00 AM | 56.18333 | 10.23333 | (56.18333, 10.23333) | 1951 |
Abee | 6 | Valid | EH4 | 107000.0 | Fell | 01/01/1952 12:00:00 AM | 54.21667 | -113.00000 | (54.21667, -113.0) | 1952 |
Acapulco | 10 | Valid | Acapulcoite | 1914.0 | Fell | 01/01/1976 12:00:00 AM | 16.88333 | -99.90000 | (16.88333, -99.9) | 1976 |
Achiras | 370 | Valid | L6 | 780.0 | Fell | 01/01/1902 12:00:00 AM | -33.16667 | -64.95000 | (-33.16667, -64.95) | 1902 |
Adhi Kot | 379 | Valid | EH4 | 4239.0 | Fell | 01/01/1919 12:00:00 AM | 32.10000 | 71.80000 | (32.1, 71.8) | 1919 |
For this set, we will visually display a map using folium. Let's create a new column that will host the descriptions of each meteorite.
We will define description as the meteorite name and size in (g).
# for the map description, lets add a description column
description = "Mass: " + df['mass (g)'].astype(str) + ", Name: " + df.index.values.astype(str)
df.insert(2, "Description", description, True)
df
id | nametype | Description | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | years | |
---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||
Aarhus | 2 | Valid | Mass: 720.0, Name: Aarhus | H6 | 720.0 | Fell | 01/01/1951 12:00:00 AM | 56.18333 | 10.23333 | (56.18333, 10.23333) | 1951 |
Abee | 6 | Valid | Mass: 107000.0, Name: Abee | EH4 | 107000.0 | Fell | 01/01/1952 12:00:00 AM | 54.21667 | -113.00000 | (54.21667, -113.0) | 1952 |
Acapulco | 10 | Valid | Mass: 1914.0, Name: Acapulco | Acapulcoite | 1914.0 | Fell | 01/01/1976 12:00:00 AM | 16.88333 | -99.90000 | (16.88333, -99.9) | 1976 |
Achiras | 370 | Valid | Mass: 780.0, Name: Achiras | L6 | 780.0 | Fell | 01/01/1902 12:00:00 AM | -33.16667 | -64.95000 | (-33.16667, -64.95) | 1902 |
Adhi Kot | 379 | Valid | Mass: 4239.0, Name: Adhi Kot | EH4 | 4239.0 | Fell | 01/01/1919 12:00:00 AM | 32.10000 | 71.80000 | (32.1, 71.8) | 1919 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Zillah 002 | 31356 | Valid | Mass: 172.0, Name: Zillah 002 | Eucrite | 172.0 | Found | 01/01/1990 12:00:00 AM | 29.03700 | 17.01850 | (29.037, 17.0185) | 1990 |
Zinder | 30409 | Valid | Mass: 46.0, Name: Zinder | Pallasite, ungrouped | 46.0 | Found | 01/01/1999 12:00:00 AM | 13.78333 | 8.96667 | (13.78333, 8.96667) | 1999 |
Zlin | 30410 | Valid | Mass: 3.3, Name: Zlin | H4 | 3.3 | Found | 01/01/1939 12:00:00 AM | 49.25000 | 17.66667 | (49.25, 17.66667) | 1939 |
Zubkovsky | 31357 | Valid | Mass: 2167.0, Name: Zubkovsky | L6 | 2167.0 | Found | 01/01/2003 12:00:00 AM | 49.78917 | 41.50460 | (49.78917, 41.5046) | 2003 |
Zulu Queen | 30414 | Valid | Mass: 200.0, Name: Zulu Queen | L3.7 | 200.0 | Found | 01/01/1976 12:00:00 AM | 33.98333 | -115.68333 | (33.98333, -115.68333) | 1976 |
37393 rows × 11 columns
Since we know "Fell" represents meteorites that have impacted Earth, we will only map out data with these coordinates in mind.
We will create a sub dataframe using this specified information (as we will continue to use the main dataframe with "Found" meteorites for future analysis)
df_fell = df[(df['fall'] == 'Fell')]
df_fell
id | nametype | Description | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | years | |
---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||
Aarhus | 2 | Valid | Mass: 720.0, Name: Aarhus | H6 | 720.0 | Fell | 01/01/1951 12:00:00 AM | 56.18333 | 10.23333 | (56.18333, 10.23333) | 1951 |
Abee | 6 | Valid | Mass: 107000.0, Name: Abee | EH4 | 107000.0 | Fell | 01/01/1952 12:00:00 AM | 54.21667 | -113.00000 | (54.21667, -113.0) | 1952 |
Acapulco | 10 | Valid | Mass: 1914.0, Name: Acapulco | Acapulcoite | 1914.0 | Fell | 01/01/1976 12:00:00 AM | 16.88333 | -99.90000 | (16.88333, -99.9) | 1976 |
Achiras | 370 | Valid | Mass: 780.0, Name: Achiras | L6 | 780.0 | Fell | 01/01/1902 12:00:00 AM | -33.16667 | -64.95000 | (-33.16667, -64.95) | 1902 |
Adhi Kot | 379 | Valid | Mass: 4239.0, Name: Adhi Kot | EH4 | 4239.0 | Fell | 01/01/1919 12:00:00 AM | 32.10000 | 71.80000 | (32.1, 71.8) | 1919 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Zemaitkiemis | 30399 | Valid | Mass: 44100.0, Name: Zemaitkiemis | L6 | 44100.0 | Fell | 01/01/1933 12:00:00 AM | 55.30000 | 25.00000 | (55.3, 25.0) | 1933 |
Zhaodong | 30404 | Valid | Mass: 42000.0, Name: Zhaodong | L4 | 42000.0 | Fell | 01/01/1984 12:00:00 AM | 45.81667 | 125.91667 | (45.81667, 125.91667) | 1984 |
Zhovtnevyi | 30407 | Valid | Mass: 107000.0, Name: Zhovtnevyi | H6 | 107000.0 | Fell | 01/01/1938 12:00:00 AM | 47.58333 | 37.25000 | (47.58333, 37.25) | 1938 |
Zhuanghe | 30408 | Valid | Mass: 2900.0, Name: Zhuanghe | H5 | 2900.0 | Fell | 01/01/1976 12:00:00 AM | 39.66667 | 122.98333 | (39.66667, 122.98333) | 1976 |
Zvonkov | 30415 | Valid | Mass: 2568.0, Name: Zvonkov | H6 | 2568.0 | Fell | 01/01/1955 12:00:00 AM | 50.20000 | 30.25000 | (50.2, 30.25) | 1955 |
674 rows × 11 columns
Before we map out anything, all we have are two columns with possible geo-coordinates. We know from experience that if we want to map a location on a map, we will need the full lat and long coordinate, as well as a possible name for that location to display to readers.
To solve our first problem, we will simply create a new column housing the concatenated versions of these lat and long coordinates.
To solve the second problem, I will introduce an API where we will parse these connected points to output the names of the countries we are dealing with
For this portion of the project, we are introduced to shapely, as well as the requests library. These will be important as shapely will help us map and deal with all things geo, whereas requests will allow us to pull necessary data from This API. This API houses country names mapped to points (which we will provide from our dataset).
Another possibility is using Google's Geo locater API (the same one Google uses for its Google Maps), however, I found that to be much, much slower compared to the raw set that the Github repo provided.
import matplotlib
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (15, 8)})
# for coordinates
import requests
from shapely.geometry import mapping, shape
from shapely.prepared import prep
from shapely.geometry import Point
# reverse-geo locating coordinates to countries
# using github which was faster than google api
geoData = requests.get("https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson").json()
country_list = list()
geo = df["GeoLocation"].values.tolist()
lat = df["reclat"].values.tolist()
lon = df["reclong"].values.tolist()
# stripping and cleaning up data coordinates
geo = [i.strip('()') for i in geo]
geo = [i.replace(' ', "") for i in geo]
geo = [i.replace(',', " ") for i in geo]
geo = [i.split(' ') for i in geo]
for i in range(len(geo)):
geo[i][0] = float(geo[i][0])
geo[i][1] = float(geo[i][1])
strtemp = " ".join([str(i) for i in geo])
geo = strtemp.replace("[", "").replace("]", "").replace(",", " ").replace(" ", " ").split(" ")
countries = {}
for i in geoData["features"]:
geom = i["geometry"]
country = i["properties"]["ADMIN"]
countries[country] = prep(shape(geom))
def get_country(lon, lat):
point = Point(lon, lat)
for country, geom in countries.items():
if geom.contains(point):
return country
return "unknown"
# make list of countries if coordinates are known
for i in range(len(lon)):
if get_country(lat[i], lon[i]) != "unknown":
country_list.append(get_country(lat[i], lon[i]))
else:
country_list.append("Unknown")
# adding a country column
df["country"] = country_list
# Displays country with most/least frequent landings
countries = list()
for i in country_list:
if i not in countries and i != "Unknown":
countries.append(i)
cc = [i for i in country_list if i != "Unknown"]
biggest_country = max(set(cc), key = cc.count)
smallest_country = min(set(cc), key = cc.count)
print("Country with the most frequent landings is:", biggest_country)
print("Country with the least frequent landings is:", smallest_country)
Country with the most frequent landings is: Sudan Country with the least frequent landings is: Cape Verde
Looking at the output, we see that after dropping coordinates that could not be mapped to a country (possibly coordinates that map to the Ocean or unnamed areas of land), we see that from 1901 to 2013, Sudan had the most meteorite landings.
In contrast, Senegal comes in with the least frequent landings.
For this part of the tutorial, we will begin plotting graphs to better show the distribution of meteorite landings across named Countries.
We parse out rows that contain unknown locations, as that will not be of much use to us.
# dropping unknown countries
df = df[df.country != "Unknown"]
df
id | nametype | Description | recclass | mass (g) | fall | year | reclat | reclong | GeoLocation | years | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
name | ||||||||||||
Aioun el Atrouss | 423 | Valid | Mass: 1000.0, Name: Aioun el Atrouss | Diogenite-pm | 1000.0 | Fell | 01/01/1974 12:00:00 AM | 16.39806 | -9.57028 | (16.39806, -9.57028) | 1974 | Angola |
Aïr | 424 | Valid | Mass: 24000.0, Name: Aïr | L6 | 24000.0 | Fell | 01/01/1925 12:00:00 AM | 19.08333 | 8.38333 | (19.08333, 8.38333) | 1925 | Central African Republic |
Akwanga | 432 | Valid | Mass: 3000.0, Name: Akwanga | H | 3000.0 | Fell | 01/01/1959 12:00:00 AM | 8.91667 | 8.43333 | (8.91667, 8.43333) | 1959 | Nigeria |
Al Zarnkh | 447 | Valid | Mass: 700.0, Name: Al Zarnkh | LL5 | 700.0 | Fell | 01/01/2001 12:00:00 AM | 13.66033 | 28.96000 | (13.66033, 28.96) | 2001 | Libya |
Alberta | 454 | Valid | Mass: 625.0, Name: Alberta | L | 625.0 | Fell | 01/01/1949 12:00:00 AM | 2.00000 | 22.66667 | (2.0, 22.66667) | 1949 | Algeria |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Zerhamra | 30403 | Valid | Mass: 630000.0, Name: Zerhamra | Iron, IIIAB-an | 630000.0 | Found | 01/01/1967 12:00:00 AM | 29.85861 | -2.64500 | (29.85861, -2.645) | 1967 | Rwanda |
Zillah 001 | 31355 | Valid | Mass: 1475.0, Name: Zillah 001 | L6 | 1475.0 | Found | 01/01/1990 12:00:00 AM | 29.03700 | 17.01850 | (29.037, 17.0185) | 1990 | Sudan |
Zillah 002 | 31356 | Valid | Mass: 172.0, Name: Zillah 002 | Eucrite | 172.0 | Found | 01/01/1990 12:00:00 AM | 29.03700 | 17.01850 | (29.037, 17.0185) | 1990 | Sudan |
Zinder | 30409 | Valid | Mass: 46.0, Name: Zinder | Pallasite, ungrouped | 46.0 | Found | 01/01/1999 12:00:00 AM | 13.78333 | 8.96667 | (13.78333, 8.96667) | 1999 | Cameroon |
Zlin | 30410 | Valid | Mass: 3.3, Name: Zlin | H4 | 3.3 | Found | 01/01/1939 12:00:00 AM | 49.25000 | 17.66667 | (49.25, 17.66667) | 1939 | Yemen |
3765 rows × 12 columns
Here we are plotting the landing frequency of each country to check if our earlier observations of Sudan having the highest number of landings are true.
# Displays bar chart to show country's with landings
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (29,26)
plt.hist(df['country'], bins=70, alpha=0.5, histtype='bar', ec='black');
plt.grid()
plt.yticks(np.arange(0, 1500, 30))
plt.xticks(
rotation=45,
horizontalalignment = 'right',
fontweight = 'light',
fontsize = 16
)
plt.ylabel('Landing Frequency', size=20)
plt.xlabel('Countries', size=20)
plt.title('Meteor Landing Frequency per Recorded Country (1901-2013)', size=20)
plt.show()
plt.savefig("Meteor_Frequency_Countries_Plot.pdf")
<Figure size 2088x1872 with 0 Axes>
Our observation remains true. We can see Sudan topping the other countries with landing counts up in the 1,470's; way bigger than the competitors. This brings up the question, why?
We can also observe the surrounding countries, and note that Ethiopia, Cameroon, and Poland also have high counts of landings. Hmmm...I wonder what all these places have in common?
Now we will display these features in an interactive map using folium as mentioned earlier as well as sklearn for the ML part.
This part will be a little bit different compared to the rest, as we will use DBScan, an unsupervised learning technique that clusters points based on two main parameters: epsilon value and min sample size.
Hypothesis: Just by knowing the geographical regions of Poland, Cameroon, Sudan, etc., aka countries with the highest landing frequencies, we can assume that the majority of our points will be classified as clusters around those countries, or the middle of the world.
Why DBScan? First off, DBScan stands for "density-based spatial clustering of applications with noise," and the main reason we will use this is because it can recognize clusters within complex shapes. This is important, as our coordinates are not mapped in terms of landing frequencies (if we did this, we can safely assume that we will have more distinct groups of points clustered together based on countries with a high number of landings, etc.). We map all coordinates, so we will have a diverse map with coordinates from the North pole to the south with no real pattern. We will use DBScan's epsilon and min points parameters to establish density clustering. This will give us noisy points (or outliers) since not every point is close to a group. This is perfectly fine for our hypothesis.
This is favorable compared to KMeans, where the idea is to find cluster centers and reevaluate these centers after each scanning of possible points. This would work for us if our data was clearer and more favorably shaped into clusters beforehand. We will see after plotting these points that the clusters are not explicitly defined. If we zoom in on the map, we will find more expressed clusters in smaller batches, however, it would overall just make our map very messy and unreadable.
# what we will color each cluster group on the map
cols = ['red', 'green', 'blue', 'orange', 'yellow', 'purple']
Note: DBScan will classify each point with an index. 1+ for each cluster group.
For example, if we have 4 defined clusters, then points in the first cluster will be labeled with 1, points in the second cluster will be labeled with 2, etc. up to 4.
Also, if a point is considered to be an outlier, it will be given a label of -1 to distinguish it from the positive cluster groups.
import folium
def mapping(df, cluster_column):
m = folium.Map(location=[df.reclat.mean(), df.reclong.mean()], zoom_start=2)
for index, row in df.iterrows():
if row[cluster_column] == -1: # outliers will be left black
cluster_colour = 'black'
else:
cluster_colour = cols[row[cluster_column]]
folium.CircleMarker(
location= [row['reclat'], row['reclong']],
radius=3,
popup= row['Description'],
color=cluster_colour,
fill=True,
fill_color=cluster_colour
).add_to(m)
return m
from sklearn.cluster import DBSCAN
location = list(zip(df["reclat"], df["reclong"]))
db = DBSCAN(eps=0.1, min_samples=30, algorithm='ball_tree', metric='haversine').fit(np.radians(location))
classes = db.labels_
df.insert(2, "db_labels", classes, True)
map = mapping(df, "db_labels")
map
As we can see, the data is not predefined into nicely clustered groups. So an algorithm like KMeans would struggle with defining dense populations like these.
We can also view 5 distinct clusters, and find that the majority of the points trend towards the center of the map.
We can view even more so that the large landing frequencies happen on the coastline of the African and European continents, verifying our earlier hypothesis that meteorite landings frequent this part of the world more so than its edges (remember: Poland, Cameroon, and Sudan had the most frequent landings. All these countries reside in these clustered regions).
Note the black points, or our outliers, and how they spread from Antarctica to Russia. They are near clusters, but because of our epsilon range, they are still quite far. Try zooming in to see just how far these outliers are from the clusters.
Also, try clicking on a data point! You will see the metadata for that meteorite. Isn't that cool?
While looking at all these occurrences, if we create a graph that displayed meteorite landing frequency as time goes on, we will see a sudden spike around the 2000s.
plt.xticks(
rotation=45,
horizontalalignment = 'right',
fontweight = 'light',
fontsize = 13
)
plt.hist(df['years'], bins=70, alpha=0.5, histtype='bar', ec='black')
plt.xlabel("Years");
plt.ylabel("Meteorite Frequency");
plt.title("Meteorite Frequency as Years Increase");
plt.show();
plt.savefig("Meteorite_Frequency_Plot.pdf");
<Figure size 2088x1872 with 0 Axes>
This brought up the question: what other events were occurring at that time to have caused this spike? I looked into what causes meteorites to form and found that they come from asteroids which mainly come from supernovas.
Let's see if we can find a similar correlation with the frequency of supernovas.
Note: this is a record of all observed supernovas within and out of our solar system.
This dataset comes with a lot of scientific and heavy-to-understand columns, so we will only concern ourselves with three.
Name: the names of the supernovas.
Disc. Date: the date the supernova was observed.
Host Name: ID for the supernova.
# lets see if there were supernovas during the years of asteroids hitting earth
df2 = pd.read_csv('The Open Supernova Catalog (1).csv', index_col=0)
df2 = df2.dropna(subset=['Disc. Date']) # dropping all null rows that dont contain a date
df2 = df2.drop(['R.A.', 'Dec.', 'Type', 'Phot.', 'Spec.', 'Radio', 'X-ray', 'mmax'], axis=1) # stuff we dont care about
df2
Disc. Date | Host Name | |
---|---|---|
Name | ||
SN2011fe | 2011/08/24 | M101 |
SN1987A | 1987/02/24 | LMC |
SN2003dh | 2003/03/31 | A104450+2131 |
SN2013dy | 2013/07/10 | NGC 7250 |
SN2013ej | 2013/07/25 | NGC 628 |
... | ... | ... |
SN1985M | 1985/06/16 | A220830-4830 |
SN1988M | 1988/04/07 | NGC 4496B |
SN386A | 386/04/30 | Milky Way |
SN393A | 393/02/27 | Milky Way |
SN837A | 837/04/29 | Milky Way |
89617 rows × 2 columns
We will similarly parse the date column into just years, as we do not care about the specific month and days.
We will then fit our data to represent years from 1901 to 2013 to match up with the meteorite range.
# Lets break up year column into just years
year = list()
for index, row in df2.iterrows():
year.append(row['Disc. Date'][:4])
from operator import contains
# getting rid of invalid years
df2['year'] = year
df2 = df2[df2.year.str.contains("^19|20\d{2}$")]
Note: ^19|20\d{2}$ is a regex pattern. We use this here because the supernova database records go back to triple-digit years, and that would be annoying to parse out manually. The pattern is looking for years from 1900-2222 with 4 digits and will get rid of anything else not meeting that criterion.
For more info on regex and python's regex library.
I recommend playing around with patterns using: https://regex101.com/
This will help you create a regex pattern and test it on values for whatever you need.
df2
Disc. Date | Host Name | year | |
---|---|---|---|
Name | |||
SN2011fe | 2011/08/24 | M101 | 2011 |
SN1987A | 1987/02/24 | LMC | 1987 |
SN2003dh | 2003/03/31 | A104450+2131 | 2003 |
SN2013dy | 2013/07/10 | NGC 7250 | 2013 |
SN2013ej | 2013/07/25 | NGC 628 | 2013 |
... | ... | ... | ... |
SN1935B | 1935/04/01 | NGC 3115 | 1935 |
SN1954J | 1954/11/26 | NGC 2403 | 1954 |
SN1982aa | 1982/01/01 | NGC 6052 | 1982 |
SN1985M | 1985/06/16 | A220830-4830 | 1985 |
SN1988M | 1988/04/07 | NGC 4496B | 1988 |
89603 rows × 3 columns
#let's drop years before 1490 and after 2013 to match years in df
df2.insert(3, "years", pd.to_numeric(df2["year"]), True)
df2 = df2.drop(['year'], axis=1)
df2.loc[df2.years < 2014]
df2.head()
Disc. Date | Host Name | years | |
---|---|---|---|
Name | |||
SN2011fe | 2011/08/24 | M101 | 2011 |
SN1987A | 1987/02/24 | LMC | 1987 |
SN2003dh | 2003/03/31 | A104450+2131 | 2003 |
SN2013dy | 2013/07/10 | NGC 7250 | 2013 |
SN2013ej | 2013/07/25 | NGC 628 | 2013 |
Let's plot a similar graph to meteorite frequency plot to see if there is also a spike in supernovas around the same years.
plt.xticks(
rotation=45,
horizontalalignment = 'right',
fontweight = 'light',
fontsize = 13
)
plt.hist(df2['years'], bins=70, alpha=0.5, histtype='bar', ec='black')
plt.xlabel("Years")
plt.ylabel("Supernova Frequency")
plt.title("Supernova Frequency as Years Increase")
plt.show();
plt.savefig("Supernova_Frequency_Plot.pdf")
<Figure size 2088x1872 with 0 Axes>
As we can see from the plot. There is a spike around the 2000s--similar to our spike with the meteorite data.
Let's form a hypothesis: I think that meteorites have a positive correlation with supernova happenings.
Let's create a linear regression model to test this out!
Let's create a new dataframe that will house the meteorite landing frequencies as well as the supernova frequencies.
We will then add a column for the years they correspond to.
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_1["supernovas"] = df2["years"].value_counts()
df_1 = df_1.reset_index()
df_1 = df_1.rename(columns={"index": "years"})
df_1 = df_1.sort_values(by="years")
df_2["meteorites"] = df["years"].value_counts()
df_2 = df_2.reset_index()
df_2 = df_2.rename(columns={"index": "years"})
df_2 = df_2.sort_values(by="years")
df_major = pd.merge(df_1, df_2, on=["years"])
df_major
years | supernovas | meteorites | |
---|---|---|---|
0 | 1901 | 2 | 4 |
1 | 1907 | 1 | 5 |
2 | 1912 | 1 | 1 |
3 | 1914 | 1 | 5 |
4 | 1915 | 1 | 3 |
... | ... | ... | ... |
81 | 2009 | 882 | 74 |
82 | 2010 | 1985 | 63 |
83 | 2011 | 1734 | 35 |
84 | 2012 | 1691 | 4 |
85 | 2013 | 2486 | 1 |
86 rows × 3 columns
Note: There were meteorite landings reported for every year from 1901-2013, however, I noticed that supernovas were not observed for some 20 years sprinkled throughout the timeline. This is interesting, as we will also see those meteorite frequencies were low during that period.
# fitting to linear regression equation
import statsmodels.api as sm
from statsmodels.formula.api import ols
fit = np.polyfit(df_major["supernovas"].astype(int), df_major["meteorites"],1)
x = fit[0]
intercept = fit[1]
y = x*(df_major["supernovas"]) + intercept
print('y = {:.2f} * supernovas + ({:.2f})'.format(x, intercept))
result = ols(formula = "meteorites ~ supernovas", data=df_major).fit()
print(result.summary())
y = 0.02 * supernovas + (37.96) OLS Regression Results ============================================================================== Dep. Variable: meteorites R-squared: 0.017 Model: OLS Adj. R-squared: 0.005 Method: Least Squares F-statistic: 1.442 Date: Mon, 20 Dec 2021 Prob (F-statistic): 0.233 Time: 13:10:39 Log-Likelihood: -501.32 No. Observations: 86 AIC: 1007. Df Residuals: 84 BIC: 1012. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 37.9595 9.777 3.882 0.000 18.516 57.403 supernovas 0.0191 0.016 1.201 0.233 -0.013 0.051 ============================================================================== Omnibus: 59.433 Durbin-Watson: 0.420 Prob(Omnibus): 0.000 Jarque-Bera (JB): 188.589 Skew: 2.491 Prob(JB): 1.12e-41 Kurtosis: 8.274 Cond. No. 669. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
All (pvalues) parameters in the model are not significant from zero, as they are less than 0.05, and with the closest being Supernovas (which is expected since it expresses higher frequencies compared to meteorites) but still not significantly different from zero.
result.params
Intercept 37.959496 supernovas 0.019121 dtype: float64
sns.set(rc = {'figure.figsize':(20,8)})
ax = sns.lmplot(x="supernovas", y="meteorites", data=df_major, height=10);
ax.fig.suptitle("Scatter Plot of Meteorite frequencies as Supernovas Occur");
plt.savefig("Scatter_Super_and_Meteor_Plot.pdf")
We can observe an increase in meteorites as supernovas increase, around 0.019121 on average per year. A small, but positive correlation!
Therefore, our hypothesis remains to be true. We can see a linear relationship between these two events.
Here we learned some interesting statistics about meteorites and just how many of them land on our planet. We saw how they tended to trend towards the center of the world (so keep that in mind if you plan to live there and fear getting hit by these big rocks!) and we saw the correlation between something that is so out-of-this-world with a phenomenon that occurs annually.
We used unsupervised and supervised learning methods (DBScan and Regression respectively) to obtain analysis from these events to further look into any types of relationships that can be spotted from the landing trends and the origins from occurrences.
Hopefully, this tutorial has introduced you to a lot of python's strengths and showed the many ways you can obtain analysis from data. From regex to reverse-engineering geocoordinates using shapely and APIs, we found out that meteorites tend to land on the coasts of Africa and Europe as well as the US. We also figured out that supernovas do indeed lead to the formation of some of these meteorites.
Thanks for reading!