%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
plt.rcParams.update({'font.size':15})
Sometimes we already have an idea of the function that we want to fit to our data. For the BoltData we know that $t = d/v$ so we know that we just want a linear relation between distance and time. But what about when we don't already know how the variables are related?
This is not unusual - this is science! Often we don't already know how different observed variables are related because that's what we're trying to find out. So where do we start?
The first thing you should always do is make a plot. Have a look at your data - does it look like a linear relationship? Does it look exponential or perhaps sinusoidal?
For this section we're going to use some data on Cepheid variable stars. If you've not come across Cepheids before, they are stars that change in brightness over time. The dataset we're using here is a sample of Cepheids in our nearest galaxy, the Large Magellanic Cloud.
Download the LMC_Cepheids.csv file and read it in to a pandas
dataframe. I'm calling mine lmc_df
. Have a look at the data
lmc_df = pd.read_csv('./data/LMC_Cepheids.csv')
lmc_df.head()
You should have a dataframe with 85 rows and 8 columns. The columns are:
star_ID
- Name of the starperiod
- the period of the star's changes in brightnessmag_1
and mag_2
- the apparent magnitude of the star at two different wavelengths. Here mag_1
corresponds to $\lambda = 3.6 \mu\text{m}$ and mag_2
is $\lambda = 4.5 \mu\text{m}$. err_mag_1
and err_mag_2
- uncertainties on the two magnitudescolour
- the star's colour; here it's defined as mag_1 - mag_2
err_colour
- the uncertainty on the colour.The periods are given in days and magnitudes, colours and their respective uncertainties are all given in units of magnitudes.
One of the cool things about Cepheids is that they pulsate. They're basically really big simple harmonic oscillators. The period we measure is the period of those pulsations. The really intersting thing is that their period is related to their intrinsic brightness. So let's have a look at that.
You should end up with a plot that looks something like this:
The first thing we need to do to this plot is to change the y
axis to go in the opposite direction. As described in the information box further up this page, magnitudes are backwards.
We can switch the direction of the y
axis using invert_yaxis
. This is where calling setting up our plot with
fig = plt.figure(figsize=(10,6))
ax = fig.add_subplot(1,1,1)
rather than just using plt.plot
comes in handy.
We don't need to go back and edit all our code. We can just use ax.invert_yaxis()
to flip the axis and call fig
to display our updated figure.
ax.invert_yaxis()
fig
Now we can start to think about the type of function we should use to fit the data. The brightness is increasing as a function of period but it doesn't look to be a linear relation.
Remember the relation between magnitude and flux given earlier: $M \propto - \log_{10} F$. So that's a good place to start. We'll try a function of the form
$$ M = a \log_{10} P + b $$When we define our function we want it to return $M$, using $P$ as the input data and find the values of $a$ and $b$:
def log_p_mag(p, a, b):
mag = a * np.log10(p) + b
return(mag)
We can now use curve_fit
to find the best values of a
and b
popt, pcov = curve_fit(log_p_mag, lmc_df['period'], lmc_df['mag_1'])
a = popt[0]
b = popt[1]
err_a = pcov[0][0]
err_b = pcov[1][1]
print("a = {0:.3f} +/- {1:.3f}".format(a, err_a))
print("b = {0:.3f} +/- {1:.3f}".format(b, err_b))
And of course we should make a plot and see how well it fits
fig2 = plt.figure(figsize=(10,6))
ax2 = fig2.add_subplot(1,1,1)
ax2.plot(lmc_df['period'], lmc_df['mag_1'], color='k', marker='o', linestyle='None', label='LMC Cepheids')
p = np.linspace(1, 150, 100)
ax2.plot(p, log_p_mag(p, a, b), color='hotpink', ls='--', label='best fit')
ax2.set_xlabel('Period (days)')
ax2.set_ylabel('Mag 1 (magnitudes)')
ax2.legend()
ax2.invert_yaxis()
plt.show()
You may have noticed in the cell above I defined a new variable, p = np.linspace(1, 150, 100)
. I can't just use the lmc_df['period']
values as the x
values to plot the line as they're not in order of increasing or decreasing period. plt.plot
plots the data in the order it sees it, so here the line would go all over the place. By using the p
variable as the x
values instead we get what we want - a nice line that shows the relation.
While this plot looks OK, we can change a couple of things to make it more convincing.
Look at where the points fall along the x-axis. They aren't evenly spaced, there's lots on the left hand side and fewer as we go to longer periods. Now we know that $M \propto \log_{10} P$ we can re-plot the data to take that into account
fig3 = plt.figure(figsize=(10,6))
ax3 = fig3.add_subplot(1,1,1)
ax3.plot(np.log10(lmc_df['period']), lmc_df['mag_1'], color='k', marker='o', linestyle='None', label='LMC Cepheids')
p = np.linspace(lmc_df['period'].min(), lmc_df['period'].max(), 100)
ax3.plot(np.log10(p), log_p_mag(p, a, b), color='hotpink', ls='--', label='best fit')
ax3.set_xlabel('$\log_{10}(P)$ (days)')
ax3.set_ylabel('Mag 1 (magnitudes)')
ax3.legend()
ax3.invert_yaxis()
plt.show()
This is a much more convincing plot! We can see the details more clearly now the values are more spaced out along the x-axis.