%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
plt.rcParams.update({'font.size':15})
What do we mean by "fitting a model to data"? A model is just something that describes your observations. It can be as simple as a straight line fit with just 1 or two unknowns, or as complicated as a cosmological model with many unknown parameters.
When we're fitting a model to data we're using some kind of theory or prediction about how different observed variables are related. A common example is fitting a straight line, $y=mx + x$, to some data to try to find a linear relationshp.
But what's actually happening when we do this? How are we determining $m$ and $c$?
We're going to look at the BoltData again to show how the process works. Find your file and read in the data into a dataframe:
bolt_df = pd.read_csv('./data/BoltData.csv', names=('distance', 'err_distance', 'time', 'err_time'), header=0)
As always, have a quick look at the dataframe to make sure it's read in as expected. Don't skip this step! It will save you a lot of time in the long run to get in the habit of checking that it looks OK right at the start.
bolt_df
Previously we've used curve_fit
to fit a best-fit line to our data. Here we're going to look at what curve_fit
is actually doing.
When we worked with this data before we defined a function such as
def best_fit(x, m, c):
return m*x + c
corrseponding to $$ f(x) = m x + c $$
and used curve_fit
to find the best values of $m$ and $c$ (the slope and intercept). But what is curve fit actually doing? What do we mean by the "best" values?
curve_fit
uses the least squares method to find the best values.
The figure below illustrates what's happening:
Our data has $N$ data points with the values $x_i$, $y_i$. The vertical distance between each point and the line (the red lines on the figure) is $$ \Delta_i = y_i - m x - c $$
Least squares finds the values of $m$ and $c$ that minimises the sum of the distances squared, i.e. it minimises $$ S = \sum_{i=1}^{N} (\Delta_i^2)$$ $$ S = \sum_{i=1}^{N} (y_i - m x_i - c)^2 $$
Why does least squares minimise $\sum \Delta_i^2$? Look at the figure above and think about why we want to minimise the squared values rather than just $\sum \Delta_i$.
To find the the values of $m$ and $c$ that minimise this sum we have to differentiate it with respect to $m$ and $c$ and solve the equations
$$\dfrac{\delta S}{\delta m} = 0$$and $$\dfrac{\delta S}{\delta c} = 0 $$
For our example of a straight line this is relatively straight forward.
For $\dfrac{\delta S}{\delta c}=0$ we find
$$ \dfrac{\delta S}{\delta c} = -2 \sum_{i=i}^{N}(y_i - m x_i - c) = 0 $$(from here onwards $\sum = \sum_{i=1}^{N}$ to make it easier to read)
Rearranging gives
$$ \sum y_i - m \sum x_i - Nc = 0 \\$$$$ c = \dfrac{\sum y_i - m \sum x_i}{N} $$The same thing is done for $\dfrac{\delta S}{\delta m}=0$, substituting in the expression for $c$ found above. I haven't included the all the steps here, but you can work through the maths to convince yourself that it works. You should end up with
$$ m = \dfrac{N \sum \left(x_i y_i\right) - \sum x_i \sum y_i}{N\sum\left(x_i^2\right) - \left(\sum x_i\right)^2} $$We can now calculate the best values of $m$ and $c$ by finding all the different summation values that we need. For a straight line fit we need $\sum x_i$, $\sum y_i$, $\sum x_i y_i$ and $\sum (x_i^2)$. We can find these using the sum
function from numpy
.
$\sum x_i$ and $\sum y_i$ are straightforward; they are just the sums of the $x$ (distance) and $y$ (time) values in our table. numpy
can do that for us:
sum_x = np.sum(bolt_df.distance)
sum_y = np.sum(bolt_df.time)
print(sum_x, sum_y)
$\sum (x^2)$ is also easy to do:
sum_x_2 = np.sum((bolt_df.distance**2))
print(sum_x_2)
Now we can plug these into the expressions we have for $m$ and $c$.
Hopefully you've ended up with a figure that looks something like this: