import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
plt.rcParams.update({'font.size':16})
exo_df = pd.read_csv("./data/Exoplanet_catalog_2019.csv")
exo_df.rename(columns={'# name':'name'}, inplace=True)
planet_df = exo_df[['name', 'orbital_period', 'semi_major_axis', 'detection_type', 'star_mass']].dropna()
Questions
Objectives
In this section we'll be analysing the data in the cleaned planet_df
dataframe.
If you haven't already, set up a dataframe called planet_df
that contains the five columns described in the previous section with rows with missing data removed.
planet_df
First let's make some histograms. Everyone loves a histogram!
pandas
works well with matplotlib
, so we can make a histogram quite easily. To make a histogram of orbital periods we can just use
planet_df['orbital_period'].hist()
Using the hist
function built into dataframes is good for getting a quick look at our data, but it's not particularly pretty. matplotlib
made much nicer plots. We can make a histogram of the periods in the same way we would with a normal variable; this time we just pass it the dataframe column we're interested in.
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hist(planet_df['orbital_period'], bins=20)
ax.set_xlabel('Orbital Period (days)')
ax.set_ylabel('Count')
ax.set_yscale('log') ## set the y-axis to log scale so we can see the details
plt.show()
But what if we want to make two histograms to compare the different detection methods? We can use the where
function in pandas
.
planet_df['orbital_period'].where(planet_df['detection_type']=='Radial Velocity')
Here we're using a logical expression on one column
planet_df['detection_type']=='Radial Velocity'
to determine which rows of another column we want to select.
Now we can make two histograms to compare the period distribtions from the different techniques.
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Radial Velocity'), label='Radial Velocity')
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Primary Transit'), label='Primary Transit')
ax.set_xlabel('Orbital Period (days)')
ax.set_ylabel('Count')
ax.set_yscale('log') ## set the y-axis to log scale so we can see the details
ax.legend()
plt.show()
Getting there!
To do a meaningful comparison we should make sure the histogram bins are the same for the two plots.
We can check the minimum and maxmimum values of the period using the min()
and max()
functions
planet_df['orbital_period'].min()
planet_df['orbital_period'].max()
I reckon 30 bins will be enough, so I'll make an array for my bin edges that goes from 0 to the maxmimum value and has nbins+1
elements. Then I can use that for the bins
parameter in plt.hist
.
bins = np.linspace(0,planet_df['orbital_period'].max(), 31)
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Radial Velocity'), bins=bins, label='Radial Velocity', alpha=0.7, edgecolor='blue')
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Primary Transit'), bins=bins, label='Primary Transit', alpha=0.6, edgecolor='red')
ax.set_xlabel('Orbital Period (days)')
ax.set_ylabel('Count')
ax.set_yscale('log') ## set the y-axis to log scale so we can see the details
ax.legend()
plt.show()
With the bins having the same width for each dataset we can be confident that there really are more planets in the first bin that were detected with the transit technique than for the radial velocity technique.
The other change I made to this plot was to make the bars semi-transparent. When you plot multiple datasets on the same plot with matplotlib
it plots them in the order you say with the second dataset is plotted over the top of the first. However, that means that some data will be hidden on the plot. If we didn't change the transparency we wouldn't be able to see how many radial velocity detections were in the first bin; all we'd know was that there must be less than (or equal to) the number of transit detections.
The transparency is controlled by the alpha
argument in the plotting function, where alpha=1
corresponds to completely opaque (the default value) and alpha=0
is completely transparent. You can play around with these values to until you have something that looks right.
Remember, the point of plots is to tell a convincing story that supports your conclusions. You can't tell a convincing story if your data can't be seen (or if your plot doesn't have axis labels, or units, or various other things that I will rant about over the next 3-4 years of your degree course....)
The last part of your coursework was to confirm Kepler's 3rd law:
$$ T^2 = \dfrac{4 \pi^2}{G} \dfrac{a^3}{M_*} $$where $T$ is the orbital period, $G$ is the gravitational constant, $a$ is the semi-major axis and $M_*$ is the mass of the host star. You were asked to plot $T^2$ as a function of $\dfrac{a^3}{M_*}$ and do a least-squares fit to find $G$. We're going to repeat that analysis here, taking advantage of the functionality of dataframes.
The first thing you were asked to do was to remove any objects that had periods greater than 10,000 days or host star masses below 0.7 $M_\odot$.
For the next step, plotting $T^2$ as a function of $\dfrac{a^3}{M_*}$ and doing a least-squares fit, we could just use the dataframe as it is. But writing all that out would get complicated. Instead, we can add some new columns to our dataframe to contain the $T^2$ and $\dfrac{a^3}{M_*}$ values. Creating a new column is straightforward.
For the $T^2$ column we just set the new column kepler_df['T_sq']
to be the orbital_period
column squared. Again, the same process we would use if we wanted to create an array called T_sq
that was the square of an orbital_period
array.
kepler_df['T_sq'] = kepler_df['orbital_period']**2
kepler_df
You should now have a dataframe that looks something like this:
kepler_df
Again, this is where the markdown cells come in handy. a_3_m
makes perfect sense to me now, but will I remember what that column corresponds to a couple of months (or hours...) from now? Probably not. So add a markdown cell with the details of what you're doing and why when you're adding new columns.
We can now make the $T^2$ vs $\dfrac{a^3}{M_*}$ plot very simply using our two new columns:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.plot(kepler_df['a_3_m'], kepler_df['T_sq'], 'k.',ls='None', label='Data')
ax.set_ylabel('$T^2$ (days$^2$)')
ax.set_xlabel('$a^3 / M_*$ (AU$^3$ $M_{\odot}^{-1}$)')
ax.legend()
plt.show()
This looks kind of how we would expect. But the units are horrible and we still need to do a least-squares fit. Let's add some new columns to the dataframe with sensible units. We'll need to convert the semi-major axis to m, the masses to kg and the periods to seconds. We could probably just convert the new columns we've just made, but doing this step by step will save your sanity in the long run.
The conversions we need are:
If everything worked correctly you should end up with a figure that looks something like this:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.plot(kepler_df['a_3_m'], kepler_df['T_sq'], 'k.',ls='None', label='Data')
ax.set_ylabel('$T^2$ (s$^2$)')
ax.set_xlabel('$a^3 / M_*$ (m$^3$ kg$^{-1}$)')
ax.legend()
plt.show()
The last thing to do is the least-squares fit to estimate $G$.
You should end up with something that looks like this:
fig