Introduction to pandas

panda.jpeg

Overview:

Questions

  • What is pandas?
  • Why don't I get a real panda?
  • I'm disappointed. I was expecting a panda.

Objectives

  • Read in data from an excel file
  • pandas dataframes

What is pandas?

Unfortunately, this does not mean finding a panda to do our coding for us (although that would be amazing and if anyone knows a panda who could do this please let me know).

Pandas is the Python Data Analysis Library. It can read pretty much anything you throw at it. If you've been putting your lab data into an excel spreadsheet, pandas can read it in so you can analyse and plot it. pandas is used commonly in science and industries such as data science; the skills you're developing here will be useful well beyond the lab units. More information about pandas can be found on the webpage here.

Reading in data

Previously we've used functions like np.loadtxt to read in data from a file. This is fine for straightforward data files; if you know how your data is formatted and know that there's no missing data anywhere then np.loadtxt will work OK.

However, it's often the case that we have data that isn't perfect. We may be using data that we didn't take ourselves, from an online database for example. The file could contain missing values that could trip up loadtxt. We could be looking at a massive file and we might only want a few specific columns out of the hundreds in the file. pandas makes dealing with things like this a lot easier.

A good example of 'imperfect' data is the Exoplanet_catalog_2019.csv file you were given for your semester 1 coursework. We're going to work with that data here. Copy the data file (download here or use the file you used last semester) to the folder you're working in and start a new notebook.

pandas dataframes and reproducibility

One of the main features of pandas is the way it handles data. The main data type for pandas is a dataframe. You can think of a dataframe like a table or an Excel spreadsheet. All the data is contained inside the dataframe and you can reference individual rows, columns and cells within the dataframe.

If dataframes are like spreadsheets, why are we using them instead of Excel? Because we want to do reproducible science. When you write a lab report and describe your experimental setup and methodology you're making your work reproducible; another person should be able to take your lab report and reproduce the experiment you did. Historically, while scientists have been good about giving details of their experiments, there has been less emphasis on the reproducibility of the data analysis. However, the tide is turning and there is now much more importance placed on reproducible research and analysis.

Here is a nice blog post about why reproducible research is important.

So why aren't we doing our analysis in Excel and just making sure that it's reproducible? Because it's hard. Excel makes it very easy to get started, typing in your data and perhaps making a quick plot, but it doesn't keep a record of what you've done. It's very easy to accidently edit your data without realising it, or to make a mistake referencing a cell and completely change your results. This isn't just a problem for science. In 2012 the bank JPMorgan lost \$6.1 billion (yes, 6.1 billion dollars) due to an Excel error.

The first thing we need to do is import pandas. You'll often see the package name abbreviated to pd if you're looking online, so we'll stick with that convention.

In [1]:
import pandas as pd

Next we can read in the file. We know it's a csv file from the filename, so we can use the pd.read_csv function. Similar to how we assign the result of np.loadtxt to some variables, we assign the result of pd.read_csv to the name of our dataframe. I'm going to call our dataframe exo_df.

In [2]:
exo_df = pd.read_csv('./data/Exoplanet_catalog_2019.csv')

We can now take a look at the dataframe:

In [3]:
exo_df
Out[3]:
# name orbital_period semi_major_axis eccentricity discovered detection_type star_name star_distance star_metallicity star_mass star_sp_type
0 11 Com b 326.03000 1.2900 0.23100 2008.0 Radial Velocity 11 Com 110.60 -0.35 2.70 G8 III
1 11 UMi b 516.22000 1.5400 0.08000 2009.0 Radial Velocity 11 UMi 119.50 0.04 1.80 K4III
2 14 And b 185.84000 0.8300 0.00000 2008.0 Radial Velocity 14 And 76.40 -0.24 2.20 K0III
3 14 Her b 1773.40000 2.7700 0.36900 2002.0 Radial Velocity 14 Her 18.10 0.43 0.90 K0 V
4 16 Cyg B b 799.50000 1.6800 0.68900 1996.0 Radial Velocity 16 Cyg B 21.41 0.08 1.01 G2.5 V
... ... ... ... ... ... ... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 0.03100 2004.0 Radial Velocity tau Gem 98.40 0.14 2.30 K2III
3828 ups And b 4.61711 0.0590 0.01186 1996.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3829 ups And c 240.93700 0.8610 0.24450 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3830 ups And d 1281.43900 2.5500 0.31600 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3831 ups And e 3848.86000 5.2456 0.00536 2010.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V

3832 rows × 11 columns

Oooh, fancy!

One useful thing to check when you read in a file is what the names of the columns are. We can do this using

In [4]:
exo_df.columns
Out[4]:
Index(['# name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')

We can see above that the first column name looks a bit strange. It has a # at the start. This was most likely put there to denote this first line as a header line. Keeping that there is going to make things difficult later on, so we can change the name of this column using rename. We pass it the old and new names of the column as a dictionary (i.e. something of the format key:value). inplace=True tells it to edit the existing dataframe. If we didn't want to edit it we could create a new dataframe using

new_df = exo_df.rename(columns={'# name':'name'})

which would leave the exo_df dataframe untouched.

Once we've used rename we can check the column names again to make sure it's worked.

In [5]:
exo_df.rename(columns={'# name':'name'}, inplace=True)
In [6]:
exo_df.columns
Out[6]:
Index(['name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')

Your data is safe with dataframes

It might seem counterintuitive to edit the column names in a dataframe when we said at the start that editing data is bad practice. However, all we're doing here only affects the dataframe. The original csv file hasn't been touched. If we realise we've made a mistake we can always start again from our original data.

Dataframes work similar to how arrays do. We can access rows or columns by indexing and slicing, just as with arrays.

However, rather than remember which column corresponds to which variable, we can pass the column name. So, if we wanted to select the orbital_period column, we just use

In [7]:
exo_df['orbital_period']
Out[7]:
0        326.03000
1        516.22000
2        185.84000
3       1773.40000
4        799.50000
           ...    
3827     305.50000
3828       4.61711
3829     240.93700
3830    1281.43900
3831    3848.86000
Name: orbital_period, Length: 3832, dtype: float64

We can pass several column names if we want to select multiple things. The column names can be in any order, as long as we pass an array of names inside the brackets. Remember that an array is a list enclosed in square brackets, so we'll need an extra set:

In [8]:
exo_df[['orbital_period', 'name', 'detection_type']]
Out[8]:
orbital_period name detection_type
0 326.03000 11 Com b Radial Velocity
1 516.22000 11 UMi b Radial Velocity
2 185.84000 14 And b Radial Velocity
3 1773.40000 14 Her b Radial Velocity
4 799.50000 16 Cyg B b Radial Velocity
... ... ... ...
3827 305.50000 tau Gem b Radial Velocity
3828 4.61711 ups And b Radial Velocity
3829 240.93700 ups And c Radial Velocity
3830 1281.43900 ups And d Radial Velocity
3831 3848.86000 ups And e Radial Velocity

3832 rows × 3 columns

Now try

exo_df['orbital_period', 'name', 'detection_type']

with just one set of brackets to select these columns. What does the error message say? Do you understand how the error message corresponds to what you've done? This may seem a slightly pointless exercise given that you know why the command doesn't work, but learning to interpret error messages is a crucial skill in coding. As you may have found already, error messages aren't always crystal clear and you'll sometimes have to dig a little to see what the real problem is.

Conditional selection

What if we were only interested in the objects in the table that we detected using the Radial Velocity technique? We should be able to select those rows using the same logical conditions we've used before.

In [9]:
exo_df['detection_type']=='Radial Velocity'
Out[9]:
0       True
1       True
2       True
3       True
4       True
        ... 
3827    True
3828    True
3829    True
3830    True
3831    True
Name: detection_type, Length: 3832, dtype: bool

In [10]:
exo_df[exo_df['detection_type']=='Radial Velocity']
Out[10]:
name orbital_period semi_major_axis eccentricity discovered detection_type star_name star_distance star_metallicity star_mass star_sp_type
0 11 Com b 326.03000 1.2900 0.23100 2008.0 Radial Velocity 11 Com 110.60 -0.35 2.70 G8 III
1 11 UMi b 516.22000 1.5400 0.08000 2009.0 Radial Velocity 11 UMi 119.50 0.04 1.80 K4III
2 14 And b 185.84000 0.8300 0.00000 2008.0 Radial Velocity 14 And 76.40 -0.24 2.20 K0III
3 14 Her b 1773.40000 2.7700 0.36900 2002.0 Radial Velocity 14 Her 18.10 0.43 0.90 K0 V
4 16 Cyg B b 799.50000 1.6800 0.68900 1996.0 Radial Velocity 16 Cyg B 21.41 0.08 1.01 G2.5 V
... ... ... ... ... ... ... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 0.03100 2004.0 Radial Velocity tau Gem 98.40 0.14 2.30 K2III
3828 ups And b 4.61711 0.0590 0.01186 1996.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3829 ups And c 240.93700 0.8610 0.24450 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3830 ups And d 1281.43900 2.5500 0.31600 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3831 ups And e 3848.86000 5.2456 0.00536 2010.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V

863 rows × 11 columns

Now we have all the data from all of the rows where detection_type is Radial Velocity.

Selecting data

Select all the objects from the dataframe that have an orbital period longer than 250 days.

Select all the objects that have an eccentricity below 0.15.

Finally, select all the objects that have an orbital period longer than 250 days and have eccentricity below 0.15.

solution

Key Points:

  • Use the pandas package to work with data files such as spreadsheets without editing the original data.
  • Dataframes are similar to arrays. You can access the data within them using the same methods as you use for arrays.
In [ ]: