An Analysis of Meteorites: More Than Just Rocks in Space?

By Elizabeth Slesarev

Introduction


Don't confuse meteors with meteorites!

"Like meteorites, meteors are objects that enter Earth's atmosphere from space. But meteors—which are typically pieces of comet dust no larger than a grain of rice—burn up before reaching the ground. ... The term “meteorite” refers only to those bodies that survive the trip through the atmosphere and reach Earth's surface." -AMNH

In this tutorial, I will be going over some of python's most useful libraries and tools when it comes to analyzing coordinates or meteorite impact sites. We will also cover some ML methods to analyze our chosen datasets to further develop and test a hypothesis regarding any patterns or relationships we may find.

We will display the phenomena of meteorite landing locations as well as determine if meteorites truly stem from supernovas.

All relevant files, images, and databases will be contained within this repo on GitHub.

Information on DBScan: https://towardsdatascience.com/dbscan-clustering-for-data-shapes-k-means-cant-handle-well-in-python-6be89af4e6ea

Information on KMeans: https://medium.datadriveninvestor.com/weighted-k-means-clustering-of-gps-coordinates-python-7c6270846163

Information on Linear Regression: https://machinelearningmastery.com/linear-regression-for-machine-learning/

Information on meteorites: https://www.amnh.org/explore/news-blogs/on-exhibit-posts/meteor-meteorite-asteroid

Information on databases, libraries and any other methods will be linked as introduced below.

Getting Started...

For this project, we will begin by importing the basics such as numpy, pandas, and matplotlib to name a few...

We will be collecting data from two sources: NASA's Meteorite Landings Databse which holds data on meteorite landings dating back to the year 1409. The database is a smaller version (ending at the year 2013) of the The Meteoritical Society's more up-to-date one which holds landings up to present time.

The second source will be parsed from later on in the tutorial, is a database containing all identified supernovas (from quite a bit ago) up to the present time. Hosted by SNE Space, this Databse contains relevant information about when supernovas occurred.

All my databases will be imported using CSV files, for ease of parsing.

For this dataset, we will be handling these columns:

name: meteorite name

id: unique meteorite ID tag

recclass: classification of meteorite based on size and other factors

mass (g): weight of the meteorite on Earth

fall: if a meteorite fell and impacted Earth

year: DateTime of year and date the meteorite impacted Earth

reclat: latitude of impact site reclong: longitude of impact site

Cleaning Up Our Data

Here we will parse out the day, month, and time out of the year column because we do not care to look at impacts in such high detail. We will worry about years only for now (which will also be useful for when we import the supernova database).

We will then limit the data to years 1901+ as it is more relevant data.

Now that we have cleaned up our first dataset, let's do some analysis on it!

For this set, we will visually display a map using folium. Let's create a new column that will host the descriptions of each meteorite.

We will define description as the meteorite name and size in (g).

Creating df_fell

Since we know "Fell" represents meteorites that have impacted Earth, we will only map out data with these coordinates in mind.

We will create a sub dataframe using this specified information (as we will continue to use the main dataframe with "Found" meteorites for future analysis)

Mapping Out Coordinates

Before we map out anything, all we have are two columns with possible geo-coordinates. We know from experience that if we want to map a location on a map, we will need the full lat and long coordinate, as well as a possible name for that location to display to readers.

To solve our first problem, we will simply create a new column housing the concatenated versions of these lat and long coordinates.

To solve the second problem, I will introduce an API where we will parse these connected points to output the names of the countries we are dealing with

For this portion of the project, we are introduced to shapely, as well as the requests library. These will be important as shapely will help us map and deal with all things geo, whereas requests will allow us to pull necessary data from This API. This API houses country names mapped to points (which we will provide from our dataset).

Another possibility is using Google's Geo locater API (the same one Google uses for its Google Maps), however, I found that to be much, much slower compared to the raw set that the Github repo provided.

Looking at the output, we see that after dropping coordinates that could not be mapped to a country (possibly coordinates that map to the Ocean or unnamed areas of land), we see that from 1901 to 2013, Sudan had the most meteorite landings.

In contrast, Senegal comes in with the least frequent landings.

Getting Ready To Visualize

For this part of the tutorial, we will begin plotting graphs to better show the distribution of meteorite landings across named Countries.

We parse out rows that contain unknown locations, as that will not be of much use to us.

Here we are plotting the landing frequency of each country to check if our earlier observations of Sudan having the highest number of landings are true.

Our observation remains true. We can see Sudan topping the other countries with landing counts up in the 1,470's; way bigger than the competitors. This brings up the question, why?

We can also observe the surrounding countries, and note that Ethiopia, Cameroon, and Poland also have high counts of landings. Hmmm...I wonder what all these places have in common?

Creating our Map

Now we will display these features in an interactive map using folium as mentioned earlier as well as sklearn for the ML part.

This part will be a little bit different compared to the rest, as we will use DBScan, an unsupervised learning technique that clusters points based on two main parameters: epsilon value and min sample size.

Hypothesis: Just by knowing the geographical regions of Poland, Cameroon, Sudan, etc., aka countries with the highest landing frequencies, we can assume that the majority of our points will be classified as clusters around those countries, or the middle of the world.

Why DBScan? First off, DBScan stands for "density-based spatial clustering of applications with noise," and the main reason we will use this is because it can recognize clusters within complex shapes. This is important, as our coordinates are not mapped in terms of landing frequencies (if we did this, we can safely assume that we will have more distinct groups of points clustered together based on countries with a high number of landings, etc.). We map all coordinates, so we will have a diverse map with coordinates from the North pole to the south with no real pattern. We will use DBScan's epsilon and min points parameters to establish density clustering. This will give us noisy points (or outliers) since not every point is close to a group. This is perfectly fine for our hypothesis.

  1. Epsilon: this is the max "distance" between each point that we will allow to be considered in the same group. We can determine this through a variety of methods, such as finding an elbow graph or calculating average distances from the model, etc., which would require a more in-depth appraisal of the data. However, since we are using an old database that has not been updated since 2013, we can safely assume that just by eyeballing a good epsilon, we can tailor this value through trial-and-error to display the best outcome. For this set, we will use 0.1 as the value.
  2. Min sample size: this is the minimum number of samples a group of epsilon-contained points must have to be considered a "cluster." We will determine this through trial-and-error for the same reasons set above. Since we are dealing with a slightly larger dataset, our value will be set to 30 points.

This is favorable compared to KMeans, where the idea is to find cluster centers and reevaluate these centers after each scanning of possible points. This would work for us if our data was clearer and more favorably shaped into clusters beforehand. We will see after plotting these points that the clusters are not explicitly defined. If we zoom in on the map, we will find more expressed clusters in smaller batches, however, it would overall just make our map very messy and unreadable.

Note: DBScan will classify each point with an index. 1+ for each cluster group.

For example, if we have 4 defined clusters, then points in the first cluster will be labeled with 1, points in the second cluster will be labeled with 2, etc. up to 4.

Also, if a point is considered to be an outlier, it will be given a label of -1 to distinguish it from the positive cluster groups.

As we can see, the data is not predefined into nicely clustered groups. So an algorithm like KMeans would struggle with defining dense populations like these.

We can also view 5 distinct clusters, and find that the majority of the points trend towards the center of the map.

We can view even more so that the large landing frequencies happen on the coastline of the African and European continents, verifying our earlier hypothesis that meteorite landings frequent this part of the world more so than its edges (remember: Poland, Cameroon, and Sudan had the most frequent landings. All these countries reside in these clustered regions).

Note the black points, or our outliers, and how they spread from Antarctica to Russia. They are near clusters, but because of our epsilon range, they are still quite far. Try zooming in to see just how far these outliers are from the clusters.

Also, try clicking on a data point! You will see the metadata for that meteorite. Isn't that cool?

Graphing Frequencies Over a Period of Time

While looking at all these occurrences, if we create a graph that displayed meteorite landing frequency as time goes on, we will see a sudden spike around the 2000s.

This brought up the question: what other events were occurring at that time to have caused this spike? I looked into what causes meteorites to form and found that they come from asteroids which mainly come from supernovas.

Parsing our Second Database

Let's see if we can find a similar correlation with the frequency of supernovas.

Note: this is a record of all observed supernovas within and out of our solar system.

This dataset comes with a lot of scientific and heavy-to-understand columns, so we will only concern ourselves with three.

Name: the names of the supernovas.

Disc. Date: the date the supernova was observed.

Host Name: ID for the supernova.

We will similarly parse the date column into just years, as we do not care about the specific month and days.

We will then fit our data to represent years from 1901 to 2013 to match up with the meteorite range.

Note: ^19|20\d{2}$ is a regex pattern. We use this here because the supernova database records go back to triple-digit years, and that would be annoying to parse out manually. The pattern is looking for years from 1900-2222 with 4 digits and will get rid of anything else not meeting that criterion.

For more info on regex and python's regex library.

I recommend playing around with patterns using: https://regex101.com/

This will help you create a regex pattern and test it on values for whatever you need.

Let's plot a similar graph to meteorite frequency plot to see if there is also a spike in supernovas around the same years.

As we can see from the plot. There is a spike around the 2000s--similar to our spike with the meteorite data.

Let's form a hypothesis: I think that meteorites have a positive correlation with supernova happenings.

Let's create a linear regression model to test this out!

Fitting a Regression Model

Let's create a new dataframe that will house the meteorite landing frequencies as well as the supernova frequencies.

We will then add a column for the years they correspond to.

Note: There were meteorite landings reported for every year from 1901-2013, however, I noticed that supernovas were not observed for some 20 years sprinkled throughout the timeline. This is interesting, as we will also see those meteorite frequencies were low during that period.

All (pvalues) parameters in the model are not significant from zero, as they are less than 0.05, and with the closest being Supernovas (which is expected since it expresses higher frequencies compared to meteorites) but still not significantly different from zero.

We can observe an increase in meteorites as supernovas increase, around 0.019121 on average per year. A small, but positive correlation!

Therefore, our hypothesis remains to be true. We can see a linear relationship between these two events.

Conclusion

Here we learned some interesting statistics about meteorites and just how many of them land on our planet. We saw how they tended to trend towards the center of the world (so keep that in mind if you plan to live there and fear getting hit by these big rocks!) and we saw the correlation between something that is so out-of-this-world with a phenomenon that occurs annually.

We used unsupervised and supervised learning methods (DBScan and Regression respectively) to obtain analysis from these events to further look into any types of relationships that can be spotted from the landing trends and the origins from occurrences.

Hopefully, this tutorial has introduced you to a lot of python's strengths and showed the many ways you can obtain analysis from data. From regex to reverse-engineering geocoordinates using shapely and APIs, we found out that meteorites tend to land on the coasts of Africa and Europe as well as the US. We also figured out that supernovas do indeed lead to the formation of some of these meteorites.

Thanks for reading!