Introduction to `pandas`¶

Overview:

Questions

What is pandas?
Why don't I get a real panda?
I'm disappointed. I was expecting a panda.

Objectives

Read in data from an excel file
pandas dataframes

What is `pandas`?¶

Unfortunately, this does not mean finding a panda to do our coding for us (although that would be amazing and if anyone knows a panda who could do this please let me know).

Pandas is the Python Data Analysis Library. It can read pretty much anything you throw at it. If you've been putting your lab data into an excel spreadsheet, pandas can read it in so you can analyse and plot it. pandas is used commonly in science and industries such as data science; the skills you're developing here will be useful well beyond the lab units. More information about pandas can be found on the webpage here.

Reading in data¶

Previously we've used functions like np.loadtxt to read in data from a file. This is fine for straightforward data files; if you know how your data is formatted and know that there's no missing data anywhere then np.loadtxt will work OK.

However, it's often the case that we have data that isn't perfect. We may be using data that we didn't take ourselves, from an online database for example. The file could contain missing values that could trip up loadtxt. We could be looking at a massive file and we might only want a few specific columns out of the hundreds in the file. pandas makes dealing with things like this a lot easier.

A good example of 'imperfect' data is the Exoplanet_catalog_2019.csv file you were given for your semester 1 coursework. We're going to work with that data here. Copy the data file (download here or use the file you used last semester) to the folder you're working in and start a new notebook.

`pandas` dataframes and reproducibility¶

One of the main features of pandas is the way it handles data. The main data type for pandas is a dataframe. You can think of a dataframe like a table or an Excel spreadsheet. All the data is contained inside the dataframe and you can reference individual rows, columns and cells within the dataframe.

If dataframes are like spreadsheets, why are we using them instead of Excel? Because we want to do reproducible science. When you write a lab report and describe your experimental setup and methodology you're making your work reproducible; another person should be able to take your lab report and reproduce the experiment you did. Historically, while scientists have been good about giving details of their experiments, there has been less emphasis on the reproducibility of the data analysis. However, the tide is turning and there is now much more importance placed on reproducible research and analysis.

Here is a nice blog post about why reproducible research is important.

So why aren't we doing our analysis in Excel and just making sure that it's reproducible? Because it's hard. Excel makes it very easy to get started, typing in your data and perhaps making a quick plot, but it doesn't keep a record of what you've done. It's very easy to accidently edit your data without realising it, or to make a mistake referencing a cell and completely change your results. This isn't just a problem for science. In 2012 the bank JPMorgan lost \$6.1 billion (yes, 6.1 billion dollars) due to an Excel error.

The first thing we need to do is import pandas. You'll often see the package name abbreviated to pd if you're looking online, so we'll stick with that convention.

import pandas as pd

Next we can read in the file. We know it's a csv file from the filename, so we can use the pd.read_csv function. Similar to how we assign the result of np.loadtxt to some variables, we assign the result of pd.read_csv to the name of our dataframe. I'm going to call our dataframe exo_df.

exo_df = pd.read_csv('./data/Exoplanet_catalog_2019.csv')

We can now take a look at the dataframe:

exo_df

Oooh, fancy!

One useful thing to check when you read in a file is what the names of the columns are. We can do this using

exo_df.columns

Index(['# name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')

We can see above that the first column name looks a bit strange. It has a # at the start. This was most likely put there to denote this first line as a header line. Keeping that there is going to make things difficult later on, so we can change the name of this column using rename. We pass it the old and new names of the column as a dictionary (i.e. something of the format key:value). inplace=True tells it to edit the existing dataframe. If we didn't want to edit it we could create a new dataframe using

new_df = exo_df.rename(columns={'# name':'name'})

which would leave the exo_df dataframe untouched.

Once we've used rename we can check the column names again to make sure it's worked.

exo_df.rename(columns={'# name':'name'}, inplace=True)

exo_df.columns

Index(['name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')

Your data is safe with dataframes

It might seem counterintuitive to edit the column names in a dataframe when we said at the start that editing data is bad practice. However, all we're doing here only affects the dataframe. The original csv file hasn't been touched. If we realise we've made a mistake we can always start again from our original data.

Dataframes work similar to how arrays do. We can access rows or columns by indexing and slicing, just as with arrays.

However, rather than remember which column corresponds to which variable, we can pass the column name. So, if we wanted to select the orbital_period column, we just use

exo_df['orbital_period']

0        326.03000
1        516.22000
2        185.84000
3       1773.40000
4        799.50000
           ...    
3827     305.50000
3828       4.61711
3829     240.93700
3830    1281.43900
3831    3848.86000
Name: orbital_period, Length: 3832, dtype: float64

We can pass several column names if we want to select multiple things. The column names can be in any order, as long as we pass an array of names inside the brackets. Remember that an array is a list enclosed in square brackets, so we'll need an extra set:

exo_df[['orbital_period', 'name', 'detection_type']]

Now try

exo_df['orbital_period', 'name', 'detection_type']

with just one set of brackets to select these columns. What does the error message say? Do you understand how the error message corresponds to what you've done? This may seem a slightly pointless exercise given that you know why the command doesn't work, but learning to interpret error messages is a crucial skill in coding. As you may have found already, error messages aren't always crystal clear and you'll sometimes have to dig a little to see what the real problem is.

Conditional selection¶

What if we were only interested in the objects in the table that we detected using the Radial Velocity technique? We should be able to select those rows using the same logical conditions we've used before.

exo_df['detection_type']=='Radial Velocity'

0       True
1       True
2       True
3       True
4       True
        ... 
3827    True
3828    True
3829    True
3830    True
3831    True
Name: detection_type, Length: 3832, dtype: bool

exo_df[exo_df['detection_type']=='Radial Velocity']

Now we have all the data from all of the rows where detection_type is Radial Velocity.

Selecting data

Select all the objects from the dataframe that have an orbital period longer than 250 days.

Select all the objects that have an eccentricity below 0.15.

Finally, select all the objects that have an orbital period longer than 250 days and have eccentricity below 0.15.

solution

Key Points:

Use the pandas package to work with data files such as spreadsheets without editing the original data.
Dataframes are similar to arrays. You can access the data within them using the same methods as you use for arrays.

	# name	orbital_period	semi_major_axis	eccentricity	discovered	detection_type	star_name	star_distance	star_metallicity	star_mass	star_sp_type
0	11 Com b	326.03000	1.2900	0.23100	2008.0	Radial Velocity	11 Com	110.60	-0.35	2.70	G8 III
1	11 UMi b	516.22000	1.5400	0.08000	2009.0	Radial Velocity	11 UMi	119.50	0.04	1.80	K4III
2	14 And b	185.84000	0.8300	0.00000	2008.0	Radial Velocity	14 And	76.40	-0.24	2.20	K0III
3	14 Her b	1773.40000	2.7700	0.36900	2002.0	Radial Velocity	14 Her	18.10	0.43	0.90	K0 V
4	16 Cyg B b	799.50000	1.6800	0.68900	1996.0	Radial Velocity	16 Cyg B	21.41	0.08	1.01	G2.5 V
...	...	...	...	...	...	...	...	...	...	...	...
3827	tau Gem b	305.50000	1.1700	0.03100	2004.0	Radial Velocity	tau Gem	98.40	0.14	2.30	K2III
3828	ups And b	4.61711	0.0590	0.01186	1996.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3829	ups And c	240.93700	0.8610	0.24450	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3830	ups And d	1281.43900	2.5500	0.31600	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3831	ups And e	3848.86000	5.2456	0.00536	2010.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V

	name	orbital_period	semi_major_axis	eccentricity	discovered	detection_type	star_name	star_distance	star_metallicity	star_mass	star_sp_type
0	11 Com b	326.03000	1.2900	0.23100	2008.0	Radial Velocity	11 Com	110.60	-0.35	2.70	G8 III
1	11 UMi b	516.22000	1.5400	0.08000	2009.0	Radial Velocity	11 UMi	119.50	0.04	1.80	K4III
2	14 And b	185.84000	0.8300	0.00000	2008.0	Radial Velocity	14 And	76.40	-0.24	2.20	K0III
3	14 Her b	1773.40000	2.7700	0.36900	2002.0	Radial Velocity	14 Her	18.10	0.43	0.90	K0 V
4	16 Cyg B b	799.50000	1.6800	0.68900	1996.0	Radial Velocity	16 Cyg B	21.41	0.08	1.01	G2.5 V
...	...	...	...	...	...	...	...	...	...	...	...
3827	tau Gem b	305.50000	1.1700	0.03100	2004.0	Radial Velocity	tau Gem	98.40	0.14	2.30	K2III
3828	ups And b	4.61711	0.0590	0.01186	1996.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3829	ups And c	240.93700	0.8610	0.24450	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3830	ups And d	1281.43900	2.5500	0.31600	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3831	ups And e	3848.86000	5.2456	0.00536	2010.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V

Introduction to pandas¶

Overview:

What is pandas?¶

Reading in data¶

pandas dataframes and reproducibility¶