In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.rcParams.update({'font.size':16}) 
In [2]:
exo_df = pd.read_csv("./data/Exoplanet_catalog_2019.csv")
exo_df.rename(columns={'# name':'name'}, inplace=True)
planet_df = exo_df[['name', 'orbital_period', 'semi_major_axis', 'detection_type', 'star_mass']].dropna()

Analysing dataframes

Questions

  • How do you plot the data in a dataframe?
  • How do you operate on columns?
  • How do you add new data?

Objectives

  • Create some histograms of the exoplanet data
  • Add new columns to the dataframe
  • Confirm Kepler's 3rd law

In this section we'll be analysing the data in the cleaned planet_df dataframe.

If you haven't already, set up a dataframe called planet_df that contains the five columns described in the previous section with rows with missing data removed.

In [3]:
planet_df
Out[3]:
name orbital_period semi_major_axis detection_type star_mass
0 11 Com b 326.03000 1.2900 Radial Velocity 2.70
1 11 UMi b 516.22000 1.5400 Radial Velocity 1.80
2 14 And b 185.84000 0.8300 Radial Velocity 2.20
3 14 Her b 1773.40000 2.7700 Radial Velocity 0.90
4 16 Cyg B b 799.50000 1.6800 Radial Velocity 1.01
... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 Radial Velocity 2.30
3828 ups And b 4.61711 0.0590 Radial Velocity 1.27
3829 ups And c 240.93700 0.8610 Radial Velocity 1.27
3830 ups And d 1281.43900 2.5500 Radial Velocity 1.27
3831 ups And e 3848.86000 5.2456 Radial Velocity 1.27

2049 rows × 5 columns

Module imports

From this point forward any import statements will be at the top of these notebook pages rather than included as we go along. It's good practice to import any modules you are going to use at the top of a notebook. I tend to import pandas, numpy and matplotlib.pyplot in most notebooks I use, so I always put those at the top. If you find you need other modules later on you can always add them to the top cell and re-run it to import the new modules.

If we need to use any new modules later (for example, we may use astropy later) will be imported at the place they're first used.

As we're making plots we'll need matplotlib.pyplot, so go ahead and import it at the top of your notebook.

First let's make some histograms. Everyone loves a histogram!

pandas works well with matplotlib, so we can make a histogram quite easily. To make a histogram of orbital periods we can just use

In [4]:
planet_df['orbital_period'].hist()
Out[4]:
<AxesSubplot:>

Using the hist function built into dataframes is good for getting a quick look at our data, but it's not particularly pretty. matplotlib made much nicer plots. We can make a histogram of the periods in the same way we would with a normal variable; this time we just pass it the dataframe column we're interested in.

In [5]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hist(planet_df['orbital_period'], bins=20)
ax.set_xlabel('Orbital Period (days)')
ax.set_ylabel('Count')
ax.set_yscale('log') ## set the y-axis to log scale so we can see the details
plt.show()

But what if we want to make two histograms to compare the different detection methods? We can use the where function in pandas.

In [6]:
planet_df['orbital_period'].where(planet_df['detection_type']=='Radial Velocity')
Out[6]:
0        326.03000
1        516.22000
2        185.84000
3       1773.40000
4        799.50000
           ...    
3827     305.50000
3828       4.61711
3829     240.93700
3830    1281.43900
3831    3848.86000
Name: orbital_period, Length: 2049, dtype: float64

Here we're using a logical expression on one column

planet_df['detection_type']=='Radial Velocity'

to determine which rows of another column we want to select.

Now we can make two histograms to compare the period distribtions from the different techniques.

In [7]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Radial Velocity'), label='Radial Velocity')
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Primary Transit'), label='Primary Transit')
ax.set_xlabel('Orbital Period (days)')
ax.set_ylabel('Count')
ax.set_yscale('log') ## set the y-axis to log scale so we can see the details
ax.legend()
plt.show()

Getting there!

To do a meaningful comparison we should make sure the histogram bins are the same for the two plots.

We can check the minimum and maxmimum values of the period using the min() and max() functions

In [8]:
planet_df['orbital_period'].min()
Out[8]:
0.07943002
In [9]:
planet_df['orbital_period'].max()
Out[9]:
27400.0

I reckon 30 bins will be enough, so I'll make an array for my bin edges that goes from 0 to the maxmimum value and has nbins+1 elements. Then I can use that for the bins parameter in plt.hist.

In [10]:
bins = np.linspace(0,planet_df['orbital_period'].max(), 31)
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Radial Velocity'), bins=bins, label='Radial Velocity', alpha=0.7, edgecolor='blue')
ax.hist(planet_df['orbital_period'].where(planet_df['detection_type']=='Primary Transit'), bins=bins, label='Primary Transit', alpha=0.6, edgecolor='red')
ax.set_xlabel('Orbital Period (days)')
ax.set_ylabel('Count')
ax.set_yscale('log') ## set the y-axis to log scale so we can see the details
ax.legend()
plt.show()

With the bins having the same width for each dataset we can be confident that there really are more planets in the first bin that were detected with the transit technique than for the radial velocity technique.

The other change I made to this plot was to make the bars semi-transparent. When you plot multiple datasets on the same plot with matplotlib it plots them in the order you say with the second dataset is plotted over the top of the first. However, that means that some data will be hidden on the plot. If we didn't change the transparency we wouldn't be able to see how many radial velocity detections were in the first bin; all we'd know was that there must be less than (or equal to) the number of transit detections.

The transparency is controlled by the alpha argument in the plotting function, where alpha=1 corresponds to completely opaque (the default value) and alpha=0 is completely transparent. You can play around with these values to until you have something that looks right.

Remember, the point of plots is to tell a convincing story that supports your conclusions. You can't tell a convincing story if your data can't be seen (or if your plot doesn't have axis labels, or units, or various other things that I will rant about over the next 3-4 years of your degree course....)

Mass histograms

Create histograms showing the number of planets detected as a function of star mass for each detection method for stars with masses below the mass of the Sun (i.e. $M <= 1 M_\odot$). Both detection methods should be shown on the same plot. Your plot should have axis labels, units, a legend and a sensible title. In a markdown cell directly below your plot, write a caption summarising the key point that a reader should take away from your figure.

Repeat this (all of it, including the caption) for stars with masses between 1 $M_\odot$ and 5 $M_\odot$.

Hint: Remember you can create a new dataframe by selecting rows/columns from an existing dataframe.

solution

Kepler's 3rd law

The last part of your coursework was to confirm Kepler's 3rd law:

$$ T^2 = \dfrac{4 \pi^2}{G} \dfrac{a^3}{M_*} $$

where $T$ is the orbital period, $G$ is the gravitational constant, $a$ is the semi-major axis and $M_*$ is the mass of the host star. You were asked to plot $T^2$ as a function of $\dfrac{a^3}{M_*}$ and do a least-squares fit to find $G$. We're going to repeat that analysis here, taking advantage of the functionality of dataframes.

The first thing you were asked to do was to remove any objects that had periods greater than 10,000 days or host star masses below 0.7 $M_\odot$.

Clean your data

Starting from the planet_df dataframe, make a new dataframe with objects with periods greater than 10,000 days or host star masses below 0.7 $M_\odot$ removed. You can call the dataframe whatever you like; I'll call mine kepler_df.

solution

For the next step, plotting $T^2$ as a function of $\dfrac{a^3}{M_*}$ and doing a least-squares fit, we could just use the dataframe as it is. But writing all that out would get complicated. Instead, we can add some new columns to our dataframe to contain the $T^2$ and $\dfrac{a^3}{M_*}$ values. Creating a new column is straightforward.

For the $T^2$ column we just set the new column kepler_df['T_sq'] to be the orbital_period column squared. Again, the same process we would use if we wanted to create an array called T_sq that was the square of an orbital_period array.

In [20]:
kepler_df['T_sq'] = kepler_df['orbital_period']**2
In [21]:
kepler_df
Out[21]:
name orbital_period semi_major_axis detection_type star_mass T_sq
0 11 Com b 326.03000 1.2900 Radial Velocity 2.70 1.062956e+05
1 11 UMi b 516.22000 1.5400 Radial Velocity 1.80 2.664831e+05
2 14 And b 185.84000 0.8300 Radial Velocity 2.20 3.453651e+04
3 14 Her b 1773.40000 2.7700 Radial Velocity 0.90 3.144948e+06
4 16 Cyg B b 799.50000 1.6800 Radial Velocity 1.01 6.392002e+05
... ... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 Radial Velocity 2.30 9.333025e+04
3828 ups And b 4.61711 0.0590 Radial Velocity 1.27 2.131770e+01
3829 ups And c 240.93700 0.8610 Radial Velocity 1.27 5.805064e+04
3830 ups And d 1281.43900 2.5500 Radial Velocity 1.27 1.642086e+06
3831 ups And e 3848.86000 5.2456 Radial Velocity 1.27 1.481372e+07

1719 rows × 6 columns

Add another column

Add another new column to your kepler_df dataframe containing the values of $\dfrac{a^3}{M_*}$. Give the new column a sensible name. I'll call this column a_3_m.

solution

You should now have a dataframe that looks something like this:

In [24]:
kepler_df
Out[24]:
name orbital_period semi_major_axis detection_type star_mass T_sq a_3_m
0 11 Com b 326.03000 1.2900 Radial Velocity 2.70 1.062956e+05 0.795070
1 11 UMi b 516.22000 1.5400 Radial Velocity 1.80 2.664831e+05 2.029036
2 14 And b 185.84000 0.8300 Radial Velocity 2.20 3.453651e+04 0.259903
3 14 Her b 1773.40000 2.7700 Radial Velocity 0.90 3.144948e+06 23.615481
4 16 Cyg B b 799.50000 1.6800 Radial Velocity 1.01 6.392002e+05 4.694685
... ... ... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 Radial Velocity 2.30 9.333025e+04 0.696353
3828 ups And b 4.61711 0.0590 Radial Velocity 1.27 2.131770e+01 0.000162
3829 ups And c 240.93700 0.8610 Radial Velocity 1.27 5.805064e+04 0.502581
3830 ups And d 1281.43900 2.5500 Radial Velocity 1.27 1.642086e+06 13.056201
3831 ups And e 3848.86000 5.2456 Radial Velocity 1.27 1.481372e+07 113.653232

1719 rows × 7 columns

Again, this is where the markdown cells come in handy. a_3_m makes perfect sense to me now, but will I remember what that column corresponds to a couple of months (or hours...) from now? Probably not. So add a markdown cell with the details of what you're doing and why when you're adding new columns.

We can now make the $T^2$ vs $\dfrac{a^3}{M_*}$ plot very simply using our two new columns:

In [25]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.plot(kepler_df['a_3_m'], kepler_df['T_sq'], 'k.',ls='None', label='Data')
ax.set_ylabel('$T^2$ (days$^2$)')
ax.set_xlabel('$a^3 / M_*$ (AU$^3$ $M_{\odot}^{-1}$)')
ax.legend()
plt.show()

This looks kind of how we would expect. But the units are horrible and we still need to do a least-squares fit. Let's add some new columns to the dataframe with sensible units. We'll need to convert the semi-major axis to m, the masses to kg and the periods to seconds. We could probably just convert the new columns we've just made, but doing this step by step will save your sanity in the long run.

The conversions we need are:

  • 1 AU = $1.496 \times 10^{11}$ m
  • 1 $M_\odot = 1.989 \times 10^{30}$ kg
  • 1 d = ..... s (this is an exercise for the reader)

Convert to sensible units

Create three new columns for orbital_period_sec, a_metres and mass_kg. Use these columns to recalculate the T_sq and a_3_m columns. Replot these new columns, updating your figure to have the correct units on the axis labels.

solution

If everything worked correctly you should end up with a figure that looks something like this:

In [28]:
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
ax.plot(kepler_df['a_3_m'], kepler_df['T_sq'], 'k.',ls='None', label='Data')
ax.set_ylabel('$T^2$ (s$^2$)')
ax.set_xlabel('$a^3 / M_*$ (m$^3$ kg$^{-1}$)')
ax.legend()
plt.show()

The last thing to do is the least-squares fit to estimate $G$.

Estimate $G \pm \sigma_G$

Use curve_fit to find an estimate of $G$ and its uncertainty. Add the best-fit line to your plot (and your legend). Add your value for $G \pm \sigma_G$ to the plot title. Tidy up your plot and write a caption in a markdown cell next to it.

solution

You should end up with something that looks like this:

In [32]:
fig
Out[32]:

Key Points:

-

In [ ]: