In [1]:
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
plt.rcParams.update({'font.size':16})

Combining datasets with pandas

Overview:

Questions

  • How do I work with multiple datasets?
  • How do I match tables?
  • How do I save a table?

Objectives

  • Combine data from different files into one dataframe
  • Save the combined data to a file

This week we're going to look at how to combine different datasets using pandas.

So far we've been lucky that the data we've needed for our analyses has all been in a single file. But what about if you have data from different sources? How do you combine your different sources into one dataframe?

We're going to be using data on a sample of Cepheid variables. Like the data you worked with last week, this is all real astronomical research data. We'll be using information from the European Space Agency's Gaia mission, NASA IPAC along with some of my own data from optical, ground-based telescopes.

Reading in the optical data

The first file we'll be working is optical_data.csv, which contains photometric measurements) of a sample of Cepheids in our own galaxy. First, we need to read the data into a dataframe, and take a look at it to see what's there.

In [2]:
optical_df = pd.read_csv('./data/optical_data.csv')
In [3]:
optical_df.head()
Out[3]:
Star_ID logP mag_B err_B mag_V err_V mag_I err_I
0 RT Aur 0.571 6.120 0.017 5.487 0.011 4.822 0.006
1 QZ Nor 0.578 9.782 0.007 8.875 0.004 7.871 0.003
2 SU Cyg 0.585 7.493 0.015 6.890 0.011 6.208 0.007
3 Y Lac 0.636 9.921 0.016 9.163 0.011 8.312 0.007
4 T Vul 0.647 6.444 0.014 5.772 0.009 5.087 0.006

.head()

In the cell above I've used

optical_df.head()

to display only the first 5 rows of the dataframe rather than the whole thing. You could also use

optical_df[:5]

to get the same output.

You can change the number of rows that head() displays by passing the number of rows as an argument, e.g.

optical_df.head(10)

The data in this file is:

  • Star_ID - the name of the star (these one's are nearby so they have actual names)
  • logP - $\log_{10} P$, where $P$ is the period of the variable star (in days)
  • mag_X - the apparent magnitude of the star in the B, V and I bands
  • err_X - the uncertainty on the apparent magnitude in each band.

Given that I'm not displaying the whole dataframe, it's worth looking at some of it's attributes so we know how much data there is. shape is an attribute of all dataframes, so that's a good place to start.

In [4]:
optical_df.shape
Out[4]:
(59, 8)

This tells us that optical_df has 8 columns and 59 rows.

Dataframes also have methods associated with them (methods were covered in your semester 1 worksheets if you want a refresher). We've already used one method today - head(). Another useful method to use here is describe():

In [5]:
optical_df.describe()
Out[5]:
logP mag_B err_B mag_V err_V mag_I err_I
count 59.000000 58.000000 58.000000 59.000000 59.000000 57.000000 57.000000
mean 1.037034 8.600448 0.017224 7.462017 0.011186 6.251632 0.007211
std 0.298511 2.092962 0.004974 1.889636 0.003355 1.662180 0.002111
min 0.571000 4.586000 0.004000 3.749000 0.002000 2.564000 0.002000
25% 0.796500 7.450250 0.014250 6.264500 0.009000 5.204000 0.006000
50% 1.007000 8.662000 0.017000 7.504000 0.011000 6.301000 0.007000
75% 1.255500 9.915750 0.021000 8.887500 0.014000 7.486000 0.009000
max 1.653000 13.071000 0.027000 11.728000 0.018000 10.080000 0.011000

describe() gives us a summary of the dataframe. The first row, count tells us the number of rows in each column that contain a value. We can see from this that 2 rows are missing data in the mag_B and err_B columns, and 3 are missing data in the mag_I and err_I columns. That's OK for now, we'll keep those rows in.

Checking which stars are missing data

If you haven't already, download the optical_data.csv file and save it somewhere sensible.

Read the data into a dataframe and use head(), shape and describe() to check your dataframe.

Some of the stars are missing some data. Check which stars are missing data in any of the columns. If you need reminding how to check for missing data you can look back at the cleaning data notebook from a couple of weeks ago.

solution

There are two other datafiles we're going to use. gaia_distances.csv and reddenings.csv.

The Gaia file contains parallax measurements in units of milliarcseconds (mas, or $10^{-3} \text{ arcseconds}$), for each star.

The reddenings file contains the extinction, $A_V$, and reddening, $E(B-V)$, at the position of each star. These are both given in magnitudes.

Read in the Gaia data and the reddening data

Download the gaia_distances.csv and reddenings.csv files and save them in your data folder.

Read these files into dataframes and use head(), shape and describe() to check your them.

For each of the three dataframes (optical data, Gaia data, reddenings) write some information about what the file contains (i.e. column names, units, number of objects, any missing data) in a markdown cell.

If everything has worked correctly, your gaia_df dataframe should have 2 columns and 67 rows, and your reddenings_df dataframe should have 3 columns and 54 rows. Neither should have any NaN values. Your three dataframes should look like these:

In [18]:
optical_df.head()
Out[18]:
Star_ID logP mag_B err_B mag_V err_V mag_I err_I
0 RT Aur 0.571 6.120 0.017 5.487 0.011 4.822 0.006
1 QZ Nor 0.578 9.782 0.007 8.875 0.004 7.871 0.003
2 SU Cyg 0.585 7.493 0.015 6.890 0.011 6.208 0.007
3 Y Lac 0.636 9.921 0.016 9.163 0.011 8.312 0.007
4 T Vul 0.647 6.444 0.014 5.772 0.009 5.087 0.006
In [19]:
gaia_df.head()
Out[19]:
Star_ID parallax_mas
0 XX Cen 0.564
1 T Mon 0.733
2 TW Nor 0.362
3 CV Mon 0.602
4 RY Sco 0.754
In [20]:
reddenings_df.head()
Out[20]:
Star_ID E_B_V A_V
0 RT Aur 0.1844 0.5717
1 QZ Nor 1.2364 3.8329
2 SU Cyg 0.9989 3.0967
3 Y Lac 0.3880 1.2028
4 T Vul 0.1702 0.5278

Combining the dataframes

You may have noticed that while the Star_ID column exists in all the dataframes, the stars aren't in the same order. This is going to make it difficult to, for example, make a plot of parallax against mag_V. The star in row 1 in gaia_df is not the same star as row 1 in optical_df. What we need to do is combine all three of our dataframes into one big one.

We can combine dataframes using pd.merge()

We'll start by combining gaia_df and optical_df with pd.merge(). You can find the documentation for pd.merge() here

merge() takes two dataframes as input. One is the left dataframe and one is the right dataframe. We also need to tell it which column the two dataframes have in common - in this case it's the Star_ID column. This is defined by the on keyword The values in this column don't need to be in the same order, pandas can figure that out.

The final thing we need to provide is how we want the data to be merged. This is the how keyword.

There are several options for how:

  • left - Uses the left dataframe as a reference, looking for matches in the right dataframe.
  • right - The opposite of left
  • outer - Keeps all the rows from both dataframes, if it can't find a match fills with NaNs
  • inner - Keeps the rows where the values in the on column appear in both dataframes.

(There is a 5th option - cross which gives the "cartesian product" of the two dataframes. You can test this out if you want, but you won't need it for anything we're doing here.)

To see how the different how options work we're going to look quickly at two new dataframes. Copy the following code into a new cell in your notebook to create the df1 and df2 dataframes:

df1 = pd.DataFrame({'l_id': ['foo', 'bar', 'baz', 'foo'],'number': [1, 2, 3, 5]})
df2 = pd.DataFrame({'r_id': ['foo', 'bar', 'baz', 'foo'],'number': [5, 6, 7, 8]})

Take a look at each dataframe. We'll use df1 as the left and df2 as the right.

In [21]:
df1
Out[21]:
l_id number
0 foo 1
1 bar 2
2 baz 3
3 foo 5
In [22]:
df2
Out[22]:
r_id number
0 foo 5
1 bar 6
2 baz 7
3 foo 8

Both have a column called number, so we'll match on that column for now.

First we'll try a left join - where it uses the values in the number column in df1 as a reference and tries to find a match in df2:

In [23]:
pd.merge(left=df1, right=df2, on='number', how='left')
Out[23]:
l_id number r_id
0 foo 1 NaN
1 bar 2 NaN
2 baz 3 NaN
3 foo 5 foo

We can see that the only one it could find a match for was the last row, where number=5. It's kept the information from df1 and filled in NaNs in the r_id column for the rows that didn't have a match in df2.

If we switch to how='right' we get the opposite:

In [24]:
pd.merge(left=df1, right=df2, on='number', how='right')
Out[24]:
l_id number r_id
0 foo 5 foo
1 NaN 6 bar
2 NaN 7 baz
3 NaN 8 foo

The options that are going to be more useful for us are outer and inner.

When we merge df1 and df2 with how='inner' we get

In [25]:
pd.merge(left=df1, right=df2, on='number', how='inner')
Out[25]:
l_id number r_id
0 foo 5 foo

This is useful sometimes - we might only want to keep rows where we have all the data available. But the one we'll be using is how='outer':

In [26]:
pd.merge(left=df1, right=df2, on='number', how='outer')
Out[26]:
l_id number r_id
0 foo 1 NaN
1 bar 2 NaN
2 baz 3 NaN
3 foo 5 foo
4 NaN 6 bar
5 NaN 7 baz
6 NaN 8 foo

This keeps all the rows, and fills in missing values with NaNs. This is what we want.

Now we know what merge is doing we can merge the gaia_df and optical_df dataframes into one big one. We'll call the new dataframe cepheids_df.

In [27]:
cepheids_df = pd.merge(left=gaia_df, right=optical_df, on='Star_ID', how='outer')
In [28]:
cepheids_df
Out[28]:
Star_ID parallax_mas logP mag_B err_B mag_V err_V mag_I err_I
0 XX Cen 0.564 1.040 8.882 0.019 7.855 0.012 6.754 0.008
1 T Mon 0.733 1.432 7.436 0.022 6.187 0.014 5.005 0.010
2 TW Nor 0.362 NaN NaN NaN NaN NaN NaN NaN
3 CV Mon 0.602 0.731 11.681 0.015 10.314 0.010 8.653 0.006
4 RY Sco 0.754 1.308 9.568 0.018 8.037 0.012 6.271 0.008
5 TT Aql 0.994 1.138 8.560 0.022 7.185 0.014 5.745 0.009
6 QZ Nor 0.471 0.578 9.782 0.007 8.875 0.004 7.871 0.003
7 Y Oph 1.349 1.234 7.573 0.011 6.161 0.007 4.543 0.005
8 VW Cen 0.256 1.177 11.754 0.022 10.306 0.014 8.802 0.009
9 V340 Nor 0.490 1.053 NaN NaN 8.407 0.005 NaN NaN
10 GY Sge 0.346 NaN NaN NaN NaN NaN NaN NaN
11 WZ Car 0.281 1.362 10.614 0.027 9.343 0.018 8.002 0.011
12 CD Cyg 0.393 1.232 10.422 0.025 9.023 0.016 7.525 0.010
13 FF Aql 1.920 0.650 6.159 0.007 5.383 0.005 4.503 0.004
14 Y Lac 0.430 0.636 9.921 0.016 9.163 0.011 8.312 0.007
15 BB Sgr 1.194 0.822 7.987 0.013 6.965 0.008 5.855 0.006
16 Z Lac 0.510 1.037 9.623 0.019 8.448 0.012 7.212 0.008
17 BG Lac 0.582 0.727 9.878 0.013 8.900 0.009 7.825 0.005
18 UU Mus 0.308 1.066 11.040 0.020 9.839 0.013 8.502 0.008
19 KN Cen 0.240 1.532 11.604 0.021 9.918 0.015 8.024 0.011
20 RU Sct 0.521 1.294 11.291 0.023 9.526 0.015 7.486 0.009
21 DL Cas 0.581 NaN NaN NaN NaN NaN NaN NaN
22 RS Pup 0.587 1.617 8.580 0.019 7.057 0.013 5.507 0.009
23 CE Cas B 0.331 NaN NaN NaN NaN NaN NaN NaN
24 SV Vul 0.405 1.653 8.810 0.018 7.267 0.012 5.719 0.009
25 VY Car 0.564 1.277 8.744 0.015 7.510 0.012 6.301 0.007
26 V367 Sct 0.470 NaN NaN NaN NaN NaN NaN NaN
27 U Nor 0.622 1.102 10.939 0.020 9.273 0.013 7.366 0.008
28 LS Pup 0.213 1.151 11.790 0.020 10.500 0.013 9.087 0.008
29 AQ Pup 0.290 1.479 10.197 0.022 8.756 0.015 7.178 0.010
... ... ... ... ... ... ... ... ... ...
37 CS Vel 0.274 0.771 13.071 0.017 11.728 0.011 10.080 0.007
38 V Cen 1.413 0.740 7.769 0.017 6.857 0.011 5.805 0.007
39 CF Cas 0.315 NaN NaN NaN NaN NaN NaN NaN
40 T Vel 0.936 0.667 9.001 0.014 8.047 0.009 6.971 0.006
41 X Cyg 0.909 1.214 7.659 0.023 6.434 0.014 5.254 0.009
42 VZ Cyg 0.541 0.687 9.900 0.015 8.983 0.010 7.970 0.006
43 CE Cas A 0.331 NaN NaN NaN NaN NaN NaN NaN
44 SW Vel 0.409 1.370 9.435 0.022 8.189 0.014 6.875 0.009
45 SZ Aql 0.521 1.234 10.227 0.025 8.697 0.016 7.103 0.010
46 S Vul 0.231 NaN NaN NaN NaN NaN NaN NaN
47 RZ Vel 0.660 1.310 8.408 0.026 7.170 0.017 5.898 0.011
48 GH Lup 0.864 0.967 8.851 0.004 7.632 0.002 6.360 0.002
49 RY Vel 0.377 1.449 9.872 0.021 8.421 0.014 6.846 0.009
50 X Sgr 2.822 0.846 5.334 0.013 4.569 0.009 3.663 0.006
51 V496 Aql 0.976 0.833 8.937 0.008 7.769 0.005 6.491 0.004
52 V350 Sgr 0.806 0.712 8.449 0.015 7.504 0.010 6.453 0.006
53 U Vul 1.291 0.903 8.492 0.016 7.149 0.011 5.609 0.007
54 W Sgr 2.354 0.881 5.472 0.017 4.694 0.011 3.863 0.007
55 SU Cyg 1.036 0.585 7.493 0.015 6.890 0.011 6.208 0.007
56 S Sge 1.685 0.923 6.488 0.016 5.641 0.011 4.782 0.006
57 beta Dor 2.917 0.993 4.586 0.013 3.758 0.009 2.946 0.006
58 U Car 0.559 1.589 7.625 0.025 6.342 0.016 5.076 0.010
59 l Car 1.942 1.551 5.048 0.017 3.749 0.011 2.564 0.007
60 zeta Gem 3.064 1.007 4.735 0.011 3.901 0.007 3.100 0.005
61 Y Sgr 2.003 0.761 6.657 0.016 5.766 0.010 4.801 0.006
62 delta Cep 3.556 0.730 4.684 0.018 3.990 0.012 NaN NaN
63 U Aql 1.748 0.847 7.536 0.016 6.457 0.011 5.279 0.007
64 eta Aql 3.674 0.856 4.744 0.017 3.918 0.011 3.036 0.007
65 RT Aur 1.841 0.571 6.120 0.017 5.487 0.011 4.822 0.006
66 SU Cru 0.211 1.109 11.613 0.015 9.802 0.008 7.658 0.003

67 rows × 9 columns

We now have a new dataframe with 67 rows and 9 columns - the same number of rows that we had in the larger of the two dataframes (gaia_df) but now with all the columns from both dataframes.

But optical_df only had 59 rows, so there must be some missing data. We should check...

In [29]:
cepheids_df[cepheids_df.isna().any(axis=1)]
Out[29]:
Star_ID parallax_mas logP mag_B err_B mag_V err_V mag_I err_I
2 TW Nor 0.362 NaN NaN NaN NaN NaN NaN NaN
9 V340 Nor 0.490 1.053 NaN NaN 8.407 0.005 NaN NaN
10 GY Sge 0.346 NaN NaN NaN NaN NaN NaN NaN
21 DL Cas 0.581 NaN NaN NaN NaN NaN NaN NaN
23 CE Cas B 0.331 NaN NaN NaN NaN NaN NaN NaN
26 V367 Sct 0.470 NaN NaN NaN NaN NaN NaN NaN
39 CF Cas 0.315 NaN NaN NaN NaN NaN NaN NaN
43 CE Cas A 0.331 NaN NaN NaN NaN NaN NaN NaN
46 S Vul 0.231 NaN NaN NaN NaN NaN NaN NaN
62 delta Cep 3.556 0.730 4.684 0.018 3.990 0.012 NaN NaN

So we've still got the two stars that were just missing some of the photometric data in optical_df -- delta Cep and TW Nor, but now we've got some extra stars that don't have any photometric data. We'll keep them all in for now and tidy things up once we've finished merging.

Add in the reddening columns

Use pd.merge() again to add the columns from reddening_df to the cepheids_df dataframe. Check what the new dataframe looks like and if there is any missing data.

Hint: You can run merge so that the resulting dataframe overwrites one of your existing ones. You don't have to create a new dataframe.

You should now have a dataframe called cepheids_df that looks something like this:

In [33]:
cepheids_df
Out[33]:
Star_ID parallax_mas logP mag_B err_B mag_V err_V mag_I err_I E_B_V A_V
0 XX Cen 0.564 1.040 8.882 0.019 7.855 0.012 6.754 0.008 0.5223 1.6191
1 T Mon 0.733 1.432 7.436 0.022 6.187 0.014 5.005 0.010 0.5997 1.8590
2 TW Nor 0.362 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 CV Mon 0.602 0.731 11.681 0.015 10.314 0.010 8.653 0.006 1.4373 4.4556
4 RY Sco 0.754 1.308 9.568 0.018 8.037 0.012 6.271 0.008 1.2228 3.7907
5 TT Aql 0.994 1.138 8.560 0.022 7.185 0.014 5.745 0.009 0.9672 2.9983
6 QZ Nor 0.471 0.578 9.782 0.007 8.875 0.004 7.871 0.003 1.2364 3.8329
7 Y Oph 1.349 1.234 7.573 0.011 6.161 0.007 4.543 0.005 0.8138 2.5227
8 VW Cen 0.256 1.177 11.754 0.022 10.306 0.014 8.802 0.009 1.3431 4.1638
9 V340 Nor 0.490 1.053 NaN NaN 8.407 0.005 NaN NaN 1.2831 3.9778
10 GY Sge 0.346 NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 WZ Car 0.281 1.362 10.614 0.027 9.343 0.018 8.002 0.011 NaN NaN
12 CD Cyg 0.393 1.232 10.422 0.025 9.023 0.016 7.525 0.010 NaN NaN
13 FF Aql 1.920 0.650 6.159 0.007 5.383 0.005 4.503 0.004 0.4996 1.5486
14 Y Lac 0.430 0.636 9.921 0.016 9.163 0.011 8.312 0.007 0.3880 1.2028
15 BB Sgr 1.194 0.822 7.987 0.013 6.965 0.008 5.855 0.006 0.2405 0.7455
16 Z Lac 0.510 1.037 9.623 0.019 8.448 0.012 7.212 0.008 0.6702 2.0775
17 BG Lac 0.582 0.727 9.878 0.013 8.900 0.009 7.825 0.005 0.3374 1.0461
18 UU Mus 0.308 1.066 11.040 0.020 9.839 0.013 8.502 0.008 0.9659 2.9944
19 KN Cen 0.240 1.532 11.604 0.021 9.918 0.015 8.024 0.011 1.5576 4.8286
20 RU Sct 0.521 1.294 11.291 0.023 9.526 0.015 7.486 0.009 NaN NaN
21 DL Cas 0.581 NaN NaN NaN NaN NaN NaN NaN 0.7936 2.4602
22 RS Pup 0.587 1.617 8.580 0.019 7.057 0.013 5.507 0.009 0.8197 2.5412
23 CE Cas B 0.331 NaN NaN NaN NaN NaN NaN NaN 0.8207 2.5440
24 SV Vul 0.405 1.653 8.810 0.018 7.267 0.012 5.719 0.009 NaN NaN
25 VY Car 0.564 1.277 8.744 0.015 7.510 0.012 6.301 0.007 1.0360 3.2115
26 V367 Sct 0.470 NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 U Nor 0.622 1.102 10.939 0.020 9.273 0.013 7.366 0.008 NaN NaN
28 LS Pup 0.213 1.151 11.790 0.020 10.500 0.013 9.087 0.008 0.7634 2.3665
29 AQ Pup 0.290 1.479 10.197 0.022 8.756 0.015 7.178 0.010 0.7818 2.4236
... ... ... ... ... ... ... ... ... ... ... ...
37 CS Vel 0.274 0.771 13.071 0.017 11.728 0.011 10.080 0.007 1.9023 5.8970
38 V Cen 1.413 0.740 7.769 0.017 6.857 0.011 5.805 0.007 1.1718 3.6327
39 CF Cas 0.315 NaN NaN NaN NaN NaN NaN NaN 0.8207 2.5440
40 T Vel 0.936 0.667 9.001 0.014 8.047 0.009 6.971 0.006 0.8183 2.5367
41 X Cyg 0.909 1.214 7.659 0.023 6.434 0.014 5.254 0.009 0.6879 2.1326
42 VZ Cyg 0.541 0.687 9.900 0.015 8.983 0.010 7.970 0.006 0.2543 0.7883
43 CE Cas A 0.331 NaN NaN NaN NaN NaN NaN NaN 0.8207 2.5440
44 SW Vel 0.409 1.370 9.435 0.022 8.189 0.014 6.875 0.009 0.9399 2.9137
45 SZ Aql 0.521 1.234 10.227 0.025 8.697 0.016 7.103 0.010 1.3786 4.2736
46 S Vul 0.231 NaN NaN NaN NaN NaN NaN NaN 1.9095 5.9195
47 RZ Vel 0.660 1.310 8.408 0.026 7.170 0.017 5.898 0.011 1.1769 3.6484
48 GH Lup 0.864 0.967 8.851 0.004 7.632 0.002 6.360 0.002 1.1761 3.6460
49 RY Vel 0.377 1.449 9.872 0.021 8.421 0.014 6.846 0.009 1.4953 4.6355
50 X Sgr 2.822 0.846 5.334 0.013 4.569 0.009 3.663 0.006 NaN NaN
51 V496 Aql 0.976 0.833 8.937 0.008 7.769 0.005 6.491 0.004 0.3182 0.9863
52 V350 Sgr 0.806 0.712 8.449 0.015 7.504 0.010 6.453 0.006 0.3277 1.0158
53 U Vul 1.291 0.903 8.492 0.016 7.149 0.011 5.609 0.007 NaN NaN
54 W Sgr 2.354 0.881 5.472 0.017 4.694 0.011 3.863 0.007 0.5578 1.7292
55 SU Cyg 1.036 0.585 7.493 0.015 6.890 0.011 6.208 0.007 0.9989 3.0967
56 S Sge 1.685 0.923 6.488 0.016 5.641 0.011 4.782 0.006 0.2888 0.8953
57 beta Dor 2.917 0.993 4.586 0.013 3.758 0.009 2.946 0.006 0.0539 0.1669
58 U Car 0.559 1.589 7.625 0.025 6.342 0.016 5.076 0.010 NaN NaN
59 l Car 1.942 1.551 5.048 0.017 3.749 0.011 2.564 0.007 0.2190 0.6788
60 zeta Gem 3.064 1.007 4.735 0.011 3.901 0.007 3.100 0.005 0.0699 0.2166
61 Y Sgr 2.003 0.761 6.657 0.016 5.766 0.010 4.801 0.006 1.7647 5.4707
62 delta Cep 3.556 0.730 4.684 0.018 3.990 0.012 NaN NaN 1.4914 4.6233
63 U Aql 1.748 0.847 7.536 0.016 6.457 0.011 5.279 0.007 0.3334 1.0337
64 eta Aql 3.674 0.856 4.744 0.017 3.918 0.011 3.036 0.007 0.1936 0.6001
65 RT Aur 1.841 0.571 6.120 0.017 5.487 0.011 4.822 0.006 0.1844 0.5717
66 SU Cru 0.211 1.109 11.613 0.015 9.802 0.008 7.658 0.003 NaN NaN

67 rows × 11 columns

Saving the data to a file

Now we have our nicely matched dataframe, we don't want to have to do all these steps every time we want to analyse the data. We can save the dataframe to a csv file using pd.to_csv().

This works in a very similar way to pd.read_csv(). There are a few arguments that we won't have used for read_csv that we'll want to use now.

Firstly, index:

index=False

index is the number you see in bold in the leftmost column when you display a dataframe. It's handy when you want to select a row while you're working with the data in pandas, but we don't need to save it to the file. Note: There may be times that you do want to write the index to the file, like if you wrote a script that had the index values it needed to use hard-coded, but generally you don't need them in the output file.

Second, float_format:

float_format='%.4f'

This tells to_csv how to format the numbers in the output file. float_format='%.4f' means 4 decimal places. If you don't set anything for float_format it will print all the decimal places. Sometimes that's what you want it to do, but it's more likely that you want to preserve the precision you have in your dataframe without all the noise that comes from your computer rounding floats.

Next, na_rep:

na_rep = 'NaN'

This tells it to print missing data as NaN in the file. If you don't set it it will just print nothing (which can be OK...) but it's better to print something so when you come back to the file later you know that it's actually missing data, not just that you've accidentally deleted something from the file.

Finally, sep is the equivalent of delimiter in read_csv. The default is sep=',' (i.e. comma separated variables) but you can change it to whatever you want. I find it easiest to leave it as the default so I don't have to remember what I used when I'm reading the file back in.

Let's save cepheids_df to a file called big_cepheids_table.csv. Save it somewhere sensible!

In [34]:
cepheids_df.to_csv('./data/big_cepheids_table.csv', index=False, float_format='%.4f', na_rep='NaN')

Now we have our data safe and sound we can go and have a cup of tea, watch Emmerdale and come back to this later.

Key Points:

  • Use pd.merge() to combine dataframes
  • There are different ways how you can join your dataframes:
    • outer keeps all the rows and fills with NaNs
    • inner only keeps the rows where the reference value appears in both frames
    • left and right keep the rows from the reference frame and fill the missing data with NaNs
  • Save a dataframe to a file with pd.to_csv()