%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.rcParams.update({'font.size':16})

Combining datasets with `pandas`¶

Overview:

Questions

How do I work with multiple datasets?
How do I match tables?
How do I save a table?

Objectives

Combine data from different files into one dataframe
Save the combined data to a file

This week we're going to look at how to combine different datasets using pandas.

So far we've been lucky that the data we've needed for our analyses has all been in a single file. But what about if you have data from different sources? How do you combine your different sources into one dataframe?

We're going to be using data on a sample of Cepheid variables. Like the data you worked with last week, this is all real astronomical research data. We'll be using information from the European Space Agency's Gaia mission, NASA IPAC along with some of my own data from optical, ground-based telescopes.

Reading in the optical data¶

The first file we'll be working is optical_data.csv, which contains photometric measurements) of a sample of Cepheids in our own galaxy. First, we need to read the data into a dataframe, and take a look at it to see what's there.

optical_df = pd.read_csv('./data/optical_data.csv')

optical_df.head()

`.head()`

In the cell above I've used

optical_df.head()

to display only the first 5 rows of the dataframe rather than the whole thing. You could also use

optical_df[:5]

to get the same output.

You can change the number of rows that head() displays by passing the number of rows as an argument, e.g.

optical_df.head(10)

The data in this file is:

Star_ID - the name of the star (these one's are nearby so they have actual names)
logP - $\log_{10} P$, where $P$ is the period of the variable star (in days)
mag_X - the apparent magnitude of the star in the B, V and I bands
err_X - the uncertainty on the apparent magnitude in each band.

Given that I'm not displaying the whole dataframe, it's worth looking at some of it's attributes so we know how much data there is. shape is an attribute of all dataframes, so that's a good place to start.

optical_df.shape

(59, 8)

This tells us that optical_df has 8 columns and 59 rows.

Dataframes also have methods associated with them (methods were covered in your semester 1 worksheets if you want a refresher). We've already used one method today - head(). Another useful method to use here is describe():

optical_df.describe()

describe() gives us a summary of the dataframe. The first row, count tells us the number of rows in each column that contain a value. We can see from this that 2 rows are missing data in the mag_B and err_B columns, and 3 are missing data in the mag_I and err_I columns. That's OK for now, we'll keep those rows in.

Checking which stars are missing data

If you haven't already, download the optical_data.csv file and save it somewhere sensible.

Read the data into a dataframe and use head(), shape and describe() to check your dataframe.

Some of the stars are missing some data. Check which stars are missing data in any of the columns. If you need reminding how to check for missing data you can look back at the cleaning data notebook from a couple of weeks ago.

solution

There are two other datafiles we're going to use. gaia_distances.csv and reddenings.csv.

The Gaia file contains parallax measurements in units of milliarcseconds (mas, or $10^{-3} \text{ arcseconds}$), for each star.

The reddenings file contains the extinction, $A_V$, and reddening, $E(B-V)$, at the position of each star. These are both given in magnitudes.

Read in the Gaia data and the reddening data

Download the gaia_distances.csv and reddenings.csv files and save them in your data folder.

Read these files into dataframes and use head(), shape and describe() to check your them.

For each of the three dataframes (optical data, Gaia data, reddenings) write some information about what the file contains (i.e. column names, units, number of objects, any missing data) in a markdown cell.

If everything has worked correctly, your gaia_df dataframe should have 2 columns and 67 rows, and your reddenings_df dataframe should have 3 columns and 54 rows. Neither should have any NaN values. Your three dataframes should look like these:

optical_df.head()

gaia_df.head()

reddenings_df.head()

Combining the dataframes¶

You may have noticed that while the Star_ID column exists in all the dataframes, the stars aren't in the same order. This is going to make it difficult to, for example, make a plot of parallax against mag_V. The star in row 1 in gaia_df is not the same star as row 1 in optical_df. What we need to do is combine all three of our dataframes into one big one.

We can combine dataframes using pd.merge()

We'll start by combining gaia_df and optical_df with pd.merge(). You can find the documentation for pd.merge() here

merge() takes two dataframes as input. One is the left dataframe and one is the right dataframe. We also need to tell it which column the two dataframes have in common - in this case it's the Star_ID column. This is defined by the on keyword The values in this column don't need to be in the same order, pandas can figure that out.

The final thing we need to provide is how we want the data to be merged. This is the how keyword.

There are several options for how:

left - Uses the left dataframe as a reference, looking for matches in the right dataframe.
right - The opposite of left
outer - Keeps all the rows from both dataframes, if it can't find a match fills with NaNs
inner - Keeps the rows where the values in the on column appear in both dataframes.

(There is a 5th option - cross which gives the "cartesian product" of the two dataframes. You can test this out if you want, but you won't need it for anything we're doing here.)

To see how the different how options work we're going to look quickly at two new dataframes. Copy the following code into a new cell in your notebook to create the df1 and df2 dataframes:

df1 = pd.DataFrame({'l_id': ['foo', 'bar', 'baz', 'foo'],'number': [1, 2, 3, 5]})
df2 = pd.DataFrame({'r_id': ['foo', 'bar', 'baz', 'foo'],'number': [5, 6, 7, 8]})

Take a look at each dataframe. We'll use df1 as the left and df2 as the right.

df1

df2

Both have a column called number, so we'll match on that column for now.

First we'll try a left join - where it uses the values in the number column in df1 as a reference and tries to find a match in df2:

pd.merge(left=df1, right=df2, on='number', how='left')

We can see that the only one it could find a match for was the last row, where number=5. It's kept the information from df1 and filled in NaNs in the r_id column for the rows that didn't have a match in df2.

If we switch to how='right' we get the opposite:

pd.merge(left=df1, right=df2, on='number', how='right')

The options that are going to be more useful for us are outer and inner.

When we merge df1 and df2 with how='inner' we get

pd.merge(left=df1, right=df2, on='number', how='inner')

This is useful sometimes - we might only want to keep rows where we have all the data available. But the one we'll be using is how='outer':

pd.merge(left=df1, right=df2, on='number', how='outer')

This keeps all the rows, and fills in missing values with NaNs. This is what we want.

Now we know what merge is doing we can merge the gaia_df and optical_df dataframes into one big one. We'll call the new dataframe cepheids_df.

cepheids_df = pd.merge(left=gaia_df, right=optical_df, on='Star_ID', how='outer')

cepheids_df

We now have a new dataframe with 67 rows and 9 columns - the same number of rows that we had in the larger of the two dataframes (gaia_df) but now with all the columns from both dataframes.

But optical_df only had 59 rows, so there must be some missing data. We should check...

cepheids_df[cepheids_df.isna().any(axis=1)]

So we've still got the two stars that were just missing some of the photometric data in optical_df -- delta Cep and TW Nor, but now we've got some extra stars that don't have any photometric data. We'll keep them all in for now and tidy things up once we've finished merging.

Add in the reddening columns

Use pd.merge() again to add the columns from reddening_df to the cepheids_df dataframe. Check what the new dataframe looks like and if there is any missing data.

Hint: You can run merge so that the resulting dataframe overwrites one of your existing ones. You don't have to create a new dataframe.

You should now have a dataframe called cepheids_df that looks something like this:

cepheids_df

Saving the data to a file¶

Now we have our nicely matched dataframe, we don't want to have to do all these steps every time we want to analyse the data. We can save the dataframe to a csv file using pd.to_csv().

This works in a very similar way to pd.read_csv(). There are a few arguments that we won't have used for read_csv that we'll want to use now.

Firstly, index:

index=False

index is the number you see in bold in the leftmost column when you display a dataframe. It's handy when you want to select a row while you're working with the data in pandas, but we don't need to save it to the file. Note: There may be times that you do want to write the index to the file, like if you wrote a script that had the index values it needed to use hard-coded, but generally you don't need them in the output file.

Second, float_format:

float_format='%.4f'

This tells to_csv how to format the numbers in the output file. float_format='%.4f' means 4 decimal places. If you don't set anything for float_format it will print all the decimal places. Sometimes that's what you want it to do, but it's more likely that you want to preserve the precision you have in your dataframe without all the noise that comes from your computer rounding floats.

Next, na_rep:

na_rep = 'NaN'

This tells it to print missing data as NaN in the file. If you don't set it it will just print nothing (which can be OK...) but it's better to print something so when you come back to the file later you know that it's actually missing data, not just that you've accidentally deleted something from the file.

Finally, sep is the equivalent of delimiter in read_csv. The default is sep=',' (i.e. comma separated variables) but you can change it to whatever you want. I find it easiest to leave it as the default so I don't have to remember what I used when I'm reading the file back in.

Let's save cepheids_df to a file called big_cepheids_table.csv. Save it somewhere sensible!

cepheids_df.to_csv('./data/big_cepheids_table.csv', index=False, float_format='%.4f', na_rep='NaN')

Now we have our data safe and sound we can go and have a cup of tea, watch Emmerdale and come back to this later.

Key Points:

Use pd.merge() to combine dataframes
There are different ways how you can join your dataframes:
- outer keeps all the rows and fills with NaNs
- inner only keeps the rows where the reference value appears in both frames
- left and right keep the rows from the reference frame and fill the missing data with NaNs
Save a dataframe to a file with pd.to_csv()

	Star_ID	logP	mag_B	err_B	mag_V	err_V	mag_I	err_I
0	RT Aur	0.571	6.120	0.017	5.487	0.011	4.822	0.006
1	QZ Nor	0.578	9.782	0.007	8.875	0.004	7.871	0.003
2	SU Cyg	0.585	7.493	0.015	6.890	0.011	6.208	0.007
3	Y Lac	0.636	9.921	0.016	9.163	0.011	8.312	0.007
4	T Vul	0.647	6.444	0.014	5.772	0.009	5.087	0.006

	logP	mag_B	err_B	mag_V	err_V	mag_I	err_I
count	59.000000	58.000000	58.000000	59.000000	59.000000	57.000000	57.000000
mean	1.037034	8.600448	0.017224	7.462017	0.011186	6.251632	0.007211
std	0.298511	2.092962	0.004974	1.889636	0.003355	1.662180	0.002111
min	0.571000	4.586000	0.004000	3.749000	0.002000	2.564000	0.002000
25%	0.796500	7.450250	0.014250	6.264500	0.009000	5.204000	0.006000
50%	1.007000	8.662000	0.017000	7.504000	0.011000	6.301000	0.007000
75%	1.255500	9.915750	0.021000	8.887500	0.014000	7.486000	0.009000
max	1.653000	13.071000	0.027000	11.728000	0.018000	10.080000	0.011000

	Star_ID	logP	mag_B	err_B	mag_V	err_V	mag_I	err_I
0	RT Aur	0.571	6.120	0.017	5.487	0.011	4.822	0.006
1	QZ Nor	0.578	9.782	0.007	8.875	0.004	7.871	0.003
2	SU Cyg	0.585	7.493	0.015	6.890	0.011	6.208	0.007
3	Y Lac	0.636	9.921	0.016	9.163	0.011	8.312	0.007
4	T Vul	0.647	6.444	0.014	5.772	0.009	5.087	0.006

	Star_ID	parallax_mas
0	XX Cen	0.564
1	T Mon	0.733
2	TW Nor	0.362
3	CV Mon	0.602
4	RY Sco	0.754

	Star_ID	E_B_V	A_V
0	RT Aur	0.1844	0.5717
1	QZ Nor	1.2364	3.8329
2	SU Cyg	0.9989	3.0967
3	Y Lac	0.3880	1.2028
4	T Vul	0.1702	0.5278

Combining datasets with `pandas`¶

Overview:

Reading in the optical data¶

`.head()`

Checking which stars are missing data

Read in the Gaia data and the reddening data

Combining the dataframes¶

Add in the reddening columns

Saving the data to a file¶

Key Points:

Previous

Schedule

Next

	Star_ID	parallax_mas	logP	mag_B	err_B	mag_V	err_V	mag_I	err_I
0	XX Cen	0.564	1.040	8.882	0.019	7.855	0.012	6.754	0.008
1	T Mon	0.733	1.432	7.436	0.022	6.187	0.014	5.005	0.010
2	TW Nor	0.362	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	CV Mon	0.602	0.731	11.681	0.015	10.314	0.010	8.653	0.006
4	RY Sco	0.754	1.308	9.568	0.018	8.037	0.012	6.271	0.008
5	TT Aql	0.994	1.138	8.560	0.022	7.185	0.014	5.745	0.009
6	QZ Nor	0.471	0.578	9.782	0.007	8.875	0.004	7.871	0.003
7	Y Oph	1.349	1.234	7.573	0.011	6.161	0.007	4.543	0.005
8	VW Cen	0.256	1.177	11.754	0.022	10.306	0.014	8.802	0.009
9	V340 Nor	0.490	1.053	NaN	NaN	8.407	0.005	NaN	NaN
10	GY Sge	0.346	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	WZ Car	0.281	1.362	10.614	0.027	9.343	0.018	8.002	0.011
12	CD Cyg	0.393	1.232	10.422	0.025	9.023	0.016	7.525	0.010
13	FF Aql	1.920	0.650	6.159	0.007	5.383	0.005	4.503	0.004
14	Y Lac	0.430	0.636	9.921	0.016	9.163	0.011	8.312	0.007
15	BB Sgr	1.194	0.822	7.987	0.013	6.965	0.008	5.855	0.006
16	Z Lac	0.510	1.037	9.623	0.019	8.448	0.012	7.212	0.008
17	BG Lac	0.582	0.727	9.878	0.013	8.900	0.009	7.825	0.005
18	UU Mus	0.308	1.066	11.040	0.020	9.839	0.013	8.502	0.008
19	KN Cen	0.240	1.532	11.604	0.021	9.918	0.015	8.024	0.011
20	RU Sct	0.521	1.294	11.291	0.023	9.526	0.015	7.486	0.009
21	DL Cas	0.581	NaN	NaN	NaN	NaN	NaN	NaN	NaN
22	RS Pup	0.587	1.617	8.580	0.019	7.057	0.013	5.507	0.009
23	CE Cas B	0.331	NaN	NaN	NaN	NaN	NaN	NaN	NaN
24	SV Vul	0.405	1.653	8.810	0.018	7.267	0.012	5.719	0.009
25	VY Car	0.564	1.277	8.744	0.015	7.510	0.012	6.301	0.007
26	V367 Sct	0.470	NaN	NaN	NaN	NaN	NaN	NaN	NaN
27	U Nor	0.622	1.102	10.939	0.020	9.273	0.013	7.366	0.008
28	LS Pup	0.213	1.151	11.790	0.020	10.500	0.013	9.087	0.008
29	AQ Pup	0.290	1.479	10.197	0.022	8.756	0.015	7.178	0.010
...	...	...	...	...	...	...	...	...	...
37	CS Vel	0.274	0.771	13.071	0.017	11.728	0.011	10.080	0.007
38	V Cen	1.413	0.740	7.769	0.017	6.857	0.011	5.805	0.007
39	CF Cas	0.315	NaN	NaN	NaN	NaN	NaN	NaN	NaN
40	T Vel	0.936	0.667	9.001	0.014	8.047	0.009	6.971	0.006
41	X Cyg	0.909	1.214	7.659	0.023	6.434	0.014	5.254	0.009
42	VZ Cyg	0.541	0.687	9.900	0.015	8.983	0.010	7.970	0.006
43	CE Cas A	0.331	NaN	NaN	NaN	NaN	NaN	NaN	NaN
44	SW Vel	0.409	1.370	9.435	0.022	8.189	0.014	6.875	0.009
45	SZ Aql	0.521	1.234	10.227	0.025	8.697	0.016	7.103	0.010
46	S Vul	0.231	NaN	NaN	NaN	NaN	NaN	NaN	NaN
47	RZ Vel	0.660	1.310	8.408	0.026	7.170	0.017	5.898	0.011
48	GH Lup	0.864	0.967	8.851	0.004	7.632	0.002	6.360	0.002
49	RY Vel	0.377	1.449	9.872	0.021	8.421	0.014	6.846	0.009
50	X Sgr	2.822	0.846	5.334	0.013	4.569	0.009	3.663	0.006
51	V496 Aql	0.976	0.833	8.937	0.008	7.769	0.005	6.491	0.004
52	V350 Sgr	0.806	0.712	8.449	0.015	7.504	0.010	6.453	0.006
53	U Vul	1.291	0.903	8.492	0.016	7.149	0.011	5.609	0.007
54	W Sgr	2.354	0.881	5.472	0.017	4.694	0.011	3.863	0.007
55	SU Cyg	1.036	0.585	7.493	0.015	6.890	0.011	6.208	0.007
56	S Sge	1.685	0.923	6.488	0.016	5.641	0.011	4.782	0.006
57	beta Dor	2.917	0.993	4.586	0.013	3.758	0.009	2.946	0.006
58	U Car	0.559	1.589	7.625	0.025	6.342	0.016	5.076	0.010
59	l Car	1.942	1.551	5.048	0.017	3.749	0.011	2.564	0.007
60	zeta Gem	3.064	1.007	4.735	0.011	3.901	0.007	3.100	0.005
61	Y Sgr	2.003	0.761	6.657	0.016	5.766	0.010	4.801	0.006
62	delta Cep	3.556	0.730	4.684	0.018	3.990	0.012	NaN	NaN
63	U Aql	1.748	0.847	7.536	0.016	6.457	0.011	5.279	0.007
64	eta Aql	3.674	0.856	4.744	0.017	3.918	0.011	3.036	0.007
65	RT Aur	1.841	0.571	6.120	0.017	5.487	0.011	4.822	0.006
66	SU Cru	0.211	1.109	11.613	0.015	9.802	0.008	7.658	0.003

	l_id	number
0	foo	1
1	bar	2
2	baz	3
3	foo	5

	r_id	number
0	foo	5
1	bar	6
2	baz	7
3	foo	8

	l_id	number	r_id
0	foo	1	NaN
1	bar	2	NaN
2	baz	3	NaN
3	foo	5	foo

	l_id	number	r_id
0	foo	5	foo
1	NaN	6	bar
2	NaN	7	baz
3	NaN	8	foo

Combining datasets with pandas¶

Overview:

Reading in the optical data¶

.head()

Checking which stars are missing data

Read in the Gaia data and the reddening data

Combining the dataframes¶

Add in the reddening columns

Saving the data to a file¶

Key Points:

Previous

Schedule

Next

Combining datasets with `pandas`¶

`.head()`