Cleaning data with pandas

Questions

  • How can we deal with imperfect data?
  • What is tidy data?

Objectives

  • Clean a dataset using pandas

In the previous section we learned how to read data in with pandas and how to access the data within a dataframe. Here we will be looking at how to "clean" our data while still staying within the bounds of reproducible science.

Read in the data

If you haven't already, import pandas and read the exoplanet data into a dataframe. I'm calling the dataframe exo_df, the same as in the last section. If you're reading in the data for the first time now rename the first column to name rather than # name. You can check the previous section if you need a refresher on how to do this. Check that the data has been read correctly by looking at your dataframe.

In [1]:
import pandas as pd
In [2]:
exo_df = pd.read_csv("./data/Exoplanet_catalog_2019.csv")
In [3]:
exo_df.rename(columns={'# name':'name'}, inplace=True)
In [4]:
exo_df
Out[4]:
name orbital_period semi_major_axis eccentricity discovered detection_type star_name star_distance star_metallicity star_mass star_sp_type
0 11 Com b 326.03000 1.2900 0.23100 2008.0 Radial Velocity 11 Com 110.60 -0.35 2.70 G8 III
1 11 UMi b 516.22000 1.5400 0.08000 2009.0 Radial Velocity 11 UMi 119.50 0.04 1.80 K4III
2 14 And b 185.84000 0.8300 0.00000 2008.0 Radial Velocity 14 And 76.40 -0.24 2.20 K0III
3 14 Her b 1773.40000 2.7700 0.36900 2002.0 Radial Velocity 14 Her 18.10 0.43 0.90 K0 V
4 16 Cyg B b 799.50000 1.6800 0.68900 1996.0 Radial Velocity 16 Cyg B 21.41 0.08 1.01 G2.5 V
... ... ... ... ... ... ... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 0.03100 2004.0 Radial Velocity tau Gem 98.40 0.14 2.30 K2III
3828 ups And b 4.61711 0.0590 0.01186 1996.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3829 ups And c 240.93700 0.8610 0.24450 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3830 ups And d 1281.43900 2.5500 0.31600 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3831 ups And e 3848.86000 5.2456 0.00536 2010.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V

3832 rows × 11 columns

As you may remember from your Semester 1 coursework, there are some gaps in the original csv file. Not every object has data available for every variable. This is a hassle when we're using something like np.loadtxt to read in the data, but pandas can handle missing variables using the NaN value.

NaN stands for 'Not a Number' and is the value that pandas fills your empty cells with.

NaN, null, NA...

NaN is probably the most common way to show missing data, but you might also come across NA or null. If you're using some data where the missing values are represented by something else (for example, 99.999 or ---- you can use the na_values argument in pd.read_csv to tell it what corresponds to missing data.

We can select all the rows in a dataframe that contain a NaN value using the isna() function. This is another logical operation; we're selecting the rows that have a NaN value. Because we want to select all the rows with a NaN and don't care which column it's in, we use .any(axis=1), where axis=1 says to do the operation on each row.

In [5]:
exo_df[exo_df.isna().any(axis=1)]
Out[5]:
name orbital_period semi_major_axis eccentricity discovered detection_type star_name star_distance star_metallicity star_mass star_sp_type
6 1SWASP J1407 b 3725.000000 3.90000 NaN 2012.0 Primary Transit 1SWASP J1407 133.000 NaN 0.900 NaN
11 38 Vir b 825.900000 1.82000 0.030 2016.0 Primary Transit 38 Vir NaN 0.07 1.180 F6 V
30 75 Cet b 691.900000 2.10000 NaN 2012.0 Radial Velocity 75 Cet 81.500 0.00 2.490 G3III
34 AD 3116 b 1.982796 NaN 0.146 2017.0 Primary Transit AD 3116 186.540 NaN 0.276 M3.9
35 AD Leo b 2.225990 0.02500 0.030 2019.0 Radial Velocity AD Leo 4.966 NaN NaN M4V
... ... ... ... ... ... ... ... ... ... ... ...
3794 beta Pic c 1238.000000 2.72000 0.248 2019.0 Radial Velocity beta Pic 19.300 NaN 1.730 A6V
3798 eps Ind A b 16509.000000 11.55000 0.260 2018.0 Radial Velocity eps Ind A 3.620 NaN 0.762 K2V
3807 kepler-80 g 14.645580 0.14000 NaN 2017.0 Primary Transit Kepler-80 NaN -0.56 NaN NaN
3818 omi UMa b 1630.000000 3.90000 NaN 2012.0 Radial Velocity omi UMa 56.300 -0.09 3.090 G4II-III
3820 pi Men c 6.268340 0.06702 NaN 2018.0 Primary Transit pi Men 18.370 0.08 1.094 G0V

2866 rows × 11 columns

At the bottom of the output we can see the size of the dataframe that's returned - 2866 rows by 11 columns. Compare that to the original size of our dataframe... that's a lot of NaNs!

It's handy to check which rows have missing data, but what if we're only concerned with those that aren't missing data? Then we can do the inverse to return only the rows without NaN values.

We can get this in almost exactly the same command, just one character different. We change exo_df.isna() to ~exo_df.isna(). The ~ is another way of saying `not.

In [6]:
exo_df[~exo_df.isna().any(axis=1)]
Out[6]:
name orbital_period semi_major_axis eccentricity discovered detection_type star_name star_distance star_metallicity star_mass star_sp_type
0 11 Com b 326.03000 1.2900 0.23100 2008.0 Radial Velocity 11 Com 110.60 -0.35 2.70 G8 III
1 11 UMi b 516.22000 1.5400 0.08000 2009.0 Radial Velocity 11 UMi 119.50 0.04 1.80 K4III
2 14 And b 185.84000 0.8300 0.00000 2008.0 Radial Velocity 14 And 76.40 -0.24 2.20 K0III
3 14 Her b 1773.40000 2.7700 0.36900 2002.0 Radial Velocity 14 Her 18.10 0.43 0.90 K0 V
4 16 Cyg B b 799.50000 1.6800 0.68900 1996.0 Radial Velocity 16 Cyg B 21.41 0.08 1.01 G2.5 V
... ... ... ... ... ... ... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 0.03100 2004.0 Radial Velocity tau Gem 98.40 0.14 2.30 K2III
3828 ups And b 4.61711 0.0590 0.01186 1996.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3829 ups And c 240.93700 0.8610 0.24450 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3830 ups And d 1281.43900 2.5500 0.31600 1999.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V
3831 ups And e 3848.86000 5.2456 0.00536 2010.0 Radial Velocity ups And 13.47 0.09 1.27 F8 V

966 rows × 11 columns

That's a lot less rows. Oh dear. But what if we don't care about some of the columns with missing data?

In your coursework you were asked to only look at the planet name, period, semi-major axis, detection technique and star mass columns. So let's create a new dataframe that only contains these columns. I'll call this one planet_df

First we need to check the names of the columns we want, then we can make our new dataframe using these names:

In [7]:
exo_df.columns
Out[7]:
Index(['name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')
In [8]:
planet_df = exo_df[['name', 'orbital_period', 'semi_major_axis', 'detection_type', 'star_mass']]
In [9]:
planet_df
Out[9]:
name orbital_period semi_major_axis detection_type star_mass
0 11 Com b 326.03000 1.2900 Radial Velocity 2.70
1 11 UMi b 516.22000 1.5400 Radial Velocity 1.80
2 14 And b 185.84000 0.8300 Radial Velocity 2.20
3 14 Her b 1773.40000 2.7700 Radial Velocity 0.90
4 16 Cyg B b 799.50000 1.6800 Radial Velocity 1.01
... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 Radial Velocity 2.30
3828 ups And b 4.61711 0.0590 Radial Velocity 1.27
3829 ups And c 240.93700 0.8610 Radial Velocity 1.27
3830 ups And d 1281.43900 2.5500 Radial Velocity 1.27
3831 ups And e 3848.86000 5.2456 Radial Velocity 1.27

3832 rows × 5 columns

Now we can check how many of these rows and columns contain NaN values. Rather than print out the dataframe and read the number underneath, we can use the shape attribute.

In [10]:
planet_df[planet_df.isna().any(axis=1)].shape
Out[10]:
(1783, 5)

This dataframe has fewer rows with missing data than the first, but still has a lot. We only care about the rows that don't have any missing data. We could use the same operation as we did before to return only the complete rows, but an easier way to do it is to use dropna. This removes any rows with missing data. If we'd done this on the original dataframe we'd have lost a lot of rows that are perfectly fine for our analysis but have missing data in another column. By using our new dataframe we don't remove data unnecessarily. Setting inplace=True tells Python to edit the dataframe rather than just show us the results.

In [11]:
planet_df.dropna(inplace=True)
/Users/vs522/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

SettingWithCopyWarning

You may have got a warning when you ran the dropna(inplace=True) command. First - don't panic! This is a warning, not an error. Here the warning is telling us that we're editing a copy of part of another dataframe, not the original dataframe. That's exactly what we intended here so it's not a problem. It becomes more important when you're doing more complex data analysis (more complex that we'll reach in this course), but for now we just need to understand what it's telling us and move on.

Let's take another look at our planet_df dataframe:

In [12]:
planet_df
Out[12]:
name orbital_period semi_major_axis detection_type star_mass
0 11 Com b 326.03000 1.2900 Radial Velocity 2.70
1 11 UMi b 516.22000 1.5400 Radial Velocity 1.80
2 14 And b 185.84000 0.8300 Radial Velocity 2.20
3 14 Her b 1773.40000 2.7700 Radial Velocity 0.90
4 16 Cyg B b 799.50000 1.6800 Radial Velocity 1.01
... ... ... ... ... ...
3827 tau Gem b 305.50000 1.1700 Radial Velocity 2.30
3828 ups And b 4.61711 0.0590 Radial Velocity 1.27
3829 ups And c 240.93700 0.8610 Radial Velocity 1.27
3830 ups And d 1281.43900 2.5500 Radial Velocity 1.27
3831 ups And e 3848.86000 5.2456 Radial Velocity 1.27

2049 rows × 5 columns

It definately has fewer rows than before, but we should still check that they're all complete...

Check your data

Use the isna function to check that you have properly cleaned the planet_df dataframe. Check the shape of the dataframe returned when you use isna and its inverse.

solution

Key Points:

  • Missing data is represented as NaN or 'Not a Number'
  • You can use the na_values argument of pd.read_csv to tell pandas which values correspond to missing data.
  • isna() can be used to find missing data
  • You can use ~ to return the inverse of a logical expression.
In [ ]: