%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.rcParams.update({'font.size':16})
pandas
¶This week we're going to look at how to combine different datasets using pandas
.
So far we've been lucky that the data we've needed for our analyses has all been in a single file. But what about if you have data from different sources? How do you combine your different sources into one dataframe?
We're going to be using data on a sample of Cepheid variables. Like the data you worked with last week, this is all real astronomical research data. We'll be using information from the European Space Agency's Gaia mission, NASA IPAC along with some of my own data from optical, ground-based telescopes.
The first file we'll be working is optical_data.csv, which contains photometric measurements) of a sample of Cepheids in our own galaxy. First, we need to read the data into a dataframe, and take a look at it to see what's there.
optical_df = pd.read_csv('./data/optical_data.csv')
optical_df.head()
The data in this file is:
Star_ID
- the name of the star (these one's are nearby so they have actual names)logP
- $\log_{10} P$, where $P$ is the period of the variable star (in days)mag_X
- the apparent magnitude of the star in the B
, V
and I
bandserr_X
- the uncertainty on the apparent magnitude in each band.Given that I'm not displaying the whole dataframe, it's worth looking at some of it's attributes so we know how much data there is. shape
is an attribute of all dataframes, so that's a good place to start.
optical_df.shape
This tells us that optical_df
has 8 columns and 59 rows.
Dataframes also have methods associated with them (methods were covered in your semester 1 worksheets if you want a refresher). We've already used one method today - head()
. Another useful method to use here is describe()
:
optical_df.describe()
describe()
gives us a summary of the dataframe. The first row, count
tells us the number of rows in each column that contain a value. We can see from this that 2 rows are missing data in the mag_B
and err_B
columns, and 3 are missing data in the mag_I
and err_I
columns. That's OK for now, we'll keep those rows in.
There are two other datafiles we're going to use. gaia_distances.csv and reddenings.csv.
The Gaia file contains parallax measurements in units of milliarcseconds (mas, or $10^{-3} \text{ arcseconds}$), for each star.
The reddenings file contains the extinction, $A_V$, and reddening, $E(B-V)$, at the position of each star. These are both given in magnitudes.
If everything has worked correctly, your gaia_df
dataframe should have 2 columns and 67 rows, and your reddenings_df
dataframe should have 3 columns and 54 rows. Neither should have any NaN
values. Your three dataframes should look like these:
optical_df.head()
gaia_df.head()
reddenings_df.head()
You may have noticed that while the Star_ID
column exists in all the dataframes, the stars aren't in the same order. This is going to make it difficult to, for example, make a plot of parallax
against mag_V
. The star in row 1 in gaia_df
is not the same star as row 1 in optical_df
. What we need to do is combine all three of our dataframes into one big one.
We can combine dataframes using pd.merge()
We'll start by combining gaia_df
and optical_df
with pd.merge()
. You can find the documentation for pd.merge()
here
merge()
takes two dataframes as input. One is the left
dataframe and one is the right
dataframe. We also need to tell it which column the two dataframes have in common - in this case it's the Star_ID
column. This is defined by the on
keyword The values in this column don't need to be in the same order, pandas
can figure that out.
The final thing we need to provide is how we want the data to be merged. This is the how
keyword.
There are several options for how
:
left
- Uses the left dataframe as a reference, looking for matches in the right dataframe.right
- The opposite of left
outer
- Keeps all the rows from both dataframes, if it can't find a match fills with NaN
sinner
- Keeps the rows where the values in the on
column appear in both dataframes.(There is a 5th option - cross
which gives the "cartesian product" of the two dataframes. You can test this out if you want, but you won't need it for anything we're doing here.)
To see how the different how
options work we're going to look quickly at two new dataframes. Copy the following code into a new cell in your notebook to create the df1
and df2
dataframes:
df1 = pd.DataFrame({'l_id': ['foo', 'bar', 'baz', 'foo'],'number': [1, 2, 3, 5]})
df2 = pd.DataFrame({'r_id': ['foo', 'bar', 'baz', 'foo'],'number': [5, 6, 7, 8]})
Take a look at each dataframe. We'll use df1
as the left and df2
as the right.
df1
df2
Both have a column called number
, so we'll match on that column for now.
First we'll try a left
join - where it uses the values in the number
column in df1
as a reference and tries to find a match in df2
:
pd.merge(left=df1, right=df2, on='number', how='left')
We can see that the only one it could find a match for was the last row, where number=5
. It's kept the information from df1
and filled in NaN
s in the r_id
column for the rows that didn't have a match in df2
.
If we switch to how='right'
we get the opposite:
pd.merge(left=df1, right=df2, on='number', how='right')
The options that are going to be more useful for us are outer
and inner
.
When we merge df1
and df2
with how='inner'
we get
pd.merge(left=df1, right=df2, on='number', how='inner')
This is useful sometimes - we might only want to keep rows where we have all the data available. But the one we'll be using is how='outer'
:
pd.merge(left=df1, right=df2, on='number', how='outer')
This keeps all the rows, and fills in missing values with NaN
s. This is what we want.
Now we know what merge
is doing we can merge the gaia_df
and optical_df
dataframes into one big one. We'll call the new dataframe cepheids_df
.
cepheids_df = pd.merge(left=gaia_df, right=optical_df, on='Star_ID', how='outer')
cepheids_df
We now have a new dataframe with 67 rows and 9 columns - the same number of rows that we had in the larger of the two dataframes (gaia_df
) but now with all the columns from both dataframes.
But optical_df
only had 59 rows, so there must be some missing data. We should check...
cepheids_df[cepheids_df.isna().any(axis=1)]
So we've still got the two stars that were just missing some of the photometric data in optical_df
-- delta Cep
and TW Nor
, but now we've got some extra stars that don't have any photometric data. We'll keep them all in for now and tidy things up once we've finished merging.
You should now have a dataframe called cepheids_df
that looks something like this:
cepheids_df
Now we have our nicely matched dataframe, we don't want to have to do all these steps every time we want to analyse the data. We can save the dataframe to a csv file using pd.to_csv()
.
This works in a very similar way to pd.read_csv()
. There are a few arguments that we won't have used for read_csv
that we'll want to use now.
Firstly, index
:
index=False
index
is the number you see in bold in the leftmost column when you display a dataframe. It's handy when you want to select a row while you're working with the data in pandas
, but we don't need to save it to the file. Note: There may be times that you do want to write the index to the file, like if you wrote a script that had the index values it needed to use hard-coded, but generally you don't need them in the output file.
Second, float_format
:
float_format='%.4f'
This tells to_csv
how to format the numbers in the output file. float_format='%.4f'
means 4 decimal places. If you don't set anything for float_format
it will print all the decimal places. Sometimes that's what you want it to do, but it's more likely that you want to preserve the precision you have in your dataframe without all the noise that comes from your computer rounding floats
.
Next, na_rep
:
na_rep = 'NaN'
This tells it to print missing data as NaN
in the file. If you don't set it it will just print nothing (which can be OK...) but it's better to print something so when you come back to the file later you know that it's actually missing data, not just that you've accidentally deleted something from the file.
Finally, sep
is the equivalent of delimiter
in read_csv
. The default is sep=','
(i.e. comma separated variables) but you can change it to whatever you want. I find it easiest to leave it as the default so I don't have to remember what I used when I'm reading the file back in.
Let's save cepheids_df
to a file called big_cepheids_table.csv
. Save it somewhere sensible!
cepheids_df.to_csv('./data/big_cepheids_table.csv', index=False, float_format='%.4f', na_rep='NaN')
Now we have our data safe and sound we can go and have a cup of tea, watch Emmerdale and come back to this later.