Cleaning data with `pandas`¶

Questions

How can we deal with imperfect data?
What is tidy data?

Objectives

Clean a dataset using pandas

In the previous section we learned how to read data in with pandas and how to access the data within a dataframe. Here we will be looking at how to "clean" our data while still staying within the bounds of reproducible science.

Read in the data

If you haven't already, import pandas and read the exoplanet data into a dataframe. I'm calling the dataframe exo_df, the same as in the last section. If you're reading in the data for the first time now rename the first column to name rather than # name. You can check the previous section if you need a refresher on how to do this. Check that the data has been read correctly by looking at your dataframe.

import pandas as pd

exo_df = pd.read_csv("./data/Exoplanet_catalog_2019.csv")

exo_df.rename(columns={'# name':'name'}, inplace=True)

exo_df

As you may remember from your Semester 1 coursework, there are some gaps in the original csv file. Not every object has data available for every variable. This is a hassle when we're using something like np.loadtxt to read in the data, but pandas can handle missing variables using the NaN value.

NaN stands for 'Not a Number' and is the value that pandas fills your empty cells with.

NaN, null, NA...

NaN is probably the most common way to show missing data, but you might also come across NA or null. If you're using some data where the missing values are represented by something else (for example, 99.999 or ---- you can use the na_values argument in pd.read_csv to tell it what corresponds to missing data.

We can select all the rows in a dataframe that contain a NaN value using the isna() function. This is another logical operation; we're selecting the rows that have a NaN value. Because we want to select all the rows with a NaN and don't care which column it's in, we use .any(axis=1), where axis=1 says to do the operation on each row.

exo_df[exo_df.isna().any(axis=1)]

At the bottom of the output we can see the size of the dataframe that's returned - 2866 rows by 11 columns. Compare that to the original size of our dataframe... that's a lot of NaNs!

It's handy to check which rows have missing data, but what if we're only concerned with those that aren't missing data? Then we can do the inverse to return only the rows without NaN values.

We can get this in almost exactly the same command, just one character different. We change exo_df.isna() to ~exo_df.isna(). The ~ is another way of saying `not.

exo_df[~exo_df.isna().any(axis=1)]

That's a lot less rows. Oh dear. But what if we don't care about some of the columns with missing data?

In your coursework you were asked to only look at the planet name, period, semi-major axis, detection technique and star mass columns. So let's create a new dataframe that only contains these columns. I'll call this one planet_df

First we need to check the names of the columns we want, then we can make our new dataframe using these names:

exo_df.columns

Index(['name', 'orbital_period', 'semi_major_axis', 'eccentricity',
       'discovered', 'detection_type', 'star_name', 'star_distance',
       'star_metallicity', 'star_mass', 'star_sp_type'],
      dtype='object')

planet_df = exo_df[['name', 'orbital_period', 'semi_major_axis', 'detection_type', 'star_mass']]

planet_df

Now we can check how many of these rows and columns contain NaN values. Rather than print out the dataframe and read the number underneath, we can use the shape attribute.

planet_df[planet_df.isna().any(axis=1)].shape

(1783, 5)

This dataframe has fewer rows with missing data than the first, but still has a lot. We only care about the rows that don't have any missing data. We could use the same operation as we did before to return only the complete rows, but an easier way to do it is to use dropna. This removes any rows with missing data. If we'd done this on the original dataframe we'd have lost a lot of rows that are perfectly fine for our analysis but have missing data in another column. By using our new dataframe we don't remove data unnecessarily. Setting inplace=True tells Python to edit the dataframe rather than just show us the results.

planet_df.dropna(inplace=True)

/Users/vs522/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

`SettingWithCopyWarning`

You may have got a warning when you ran the dropna(inplace=True) command. First - don't panic! This is a warning, not an error. Here the warning is telling us that we're editing a copy of part of another dataframe, not the original dataframe. That's exactly what we intended here so it's not a problem. It becomes more important when you're doing more complex data analysis (more complex that we'll reach in this course), but for now we just need to understand what it's telling us and move on.

Let's take another look at our planet_df dataframe:

planet_df

It definately has fewer rows than before, but we should still check that they're all complete...

Check your data

Use the isna function to check that you have properly cleaned the planet_df dataframe. Check the shape of the dataframe returned when you use isna and its inverse.

solution

Key Points:

Missing data is represented as NaN or 'Not a Number'
You can use the na_values argument of pd.read_csv to tell pandas which values correspond to missing data.
isna() can be used to find missing data
You can use ~ to return the inverse of a logical expression.

	name	orbital_period	semi_major_axis	eccentricity	discovered	detection_type	star_name	star_distance	star_metallicity	star_mass	star_sp_type
0	11 Com b	326.03000	1.2900	0.23100	2008.0	Radial Velocity	11 Com	110.60	-0.35	2.70	G8 III
1	11 UMi b	516.22000	1.5400	0.08000	2009.0	Radial Velocity	11 UMi	119.50	0.04	1.80	K4III
2	14 And b	185.84000	0.8300	0.00000	2008.0	Radial Velocity	14 And	76.40	-0.24	2.20	K0III
3	14 Her b	1773.40000	2.7700	0.36900	2002.0	Radial Velocity	14 Her	18.10	0.43	0.90	K0 V
4	16 Cyg B b	799.50000	1.6800	0.68900	1996.0	Radial Velocity	16 Cyg B	21.41	0.08	1.01	G2.5 V
...	...	...	...	...	...	...	...	...	...	...	...
3827	tau Gem b	305.50000	1.1700	0.03100	2004.0	Radial Velocity	tau Gem	98.40	0.14	2.30	K2III
3828	ups And b	4.61711	0.0590	0.01186	1996.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3829	ups And c	240.93700	0.8610	0.24450	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3830	ups And d	1281.43900	2.5500	0.31600	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3831	ups And e	3848.86000	5.2456	0.00536	2010.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V

	name	orbital_period	semi_major_axis	eccentricity	discovered	detection_type	star_name	star_distance	star_metallicity	star_mass	star_sp_type
6	1SWASP J1407 b	3725.000000	3.90000	NaN	2012.0	Primary Transit	1SWASP J1407	133.000	NaN	0.900	NaN
11	38 Vir b	825.900000	1.82000	0.030	2016.0	Primary Transit	38 Vir	NaN	0.07	1.180	F6 V
30	75 Cet b	691.900000	2.10000	NaN	2012.0	Radial Velocity	75 Cet	81.500	0.00	2.490	G3III
34	AD 3116 b	1.982796	NaN	0.146	2017.0	Primary Transit	AD 3116	186.540	NaN	0.276	M3.9
35	AD Leo b	2.225990	0.02500	0.030	2019.0	Radial Velocity	AD Leo	4.966	NaN	NaN	M4V
...	...	...	...	...	...	...	...	...	...	...	...
3794	beta Pic c	1238.000000	2.72000	0.248	2019.0	Radial Velocity	beta Pic	19.300	NaN	1.730	A6V
3798	eps Ind A b	16509.000000	11.55000	0.260	2018.0	Radial Velocity	eps Ind A	3.620	NaN	0.762	K2V
3807	kepler-80 g	14.645580	0.14000	NaN	2017.0	Primary Transit	Kepler-80	NaN	-0.56	NaN	NaN
3818	omi UMa b	1630.000000	3.90000	NaN	2012.0	Radial Velocity	omi UMa	56.300	-0.09	3.090	G4II-III
3820	pi Men c	6.268340	0.06702	NaN	2018.0	Primary Transit	pi Men	18.370	0.08	1.094	G0V

	name	orbital_period	semi_major_axis	eccentricity	discovered	detection_type	star_name	star_distance	star_metallicity	star_mass	star_sp_type
0	11 Com b	326.03000	1.2900	0.23100	2008.0	Radial Velocity	11 Com	110.60	-0.35	2.70	G8 III
1	11 UMi b	516.22000	1.5400	0.08000	2009.0	Radial Velocity	11 UMi	119.50	0.04	1.80	K4III
2	14 And b	185.84000	0.8300	0.00000	2008.0	Radial Velocity	14 And	76.40	-0.24	2.20	K0III
3	14 Her b	1773.40000	2.7700	0.36900	2002.0	Radial Velocity	14 Her	18.10	0.43	0.90	K0 V
4	16 Cyg B b	799.50000	1.6800	0.68900	1996.0	Radial Velocity	16 Cyg B	21.41	0.08	1.01	G2.5 V
...	...	...	...	...	...	...	...	...	...	...	...
3827	tau Gem b	305.50000	1.1700	0.03100	2004.0	Radial Velocity	tau Gem	98.40	0.14	2.30	K2III
3828	ups And b	4.61711	0.0590	0.01186	1996.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3829	ups And c	240.93700	0.8610	0.24450	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3830	ups And d	1281.43900	2.5500	0.31600	1999.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V
3831	ups And e	3848.86000	5.2456	0.00536	2010.0	Radial Velocity	ups And	13.47	0.09	1.27	F8 V

	name	orbital_period	semi_major_axis	detection_type	star_mass
0	11 Com b	326.03000	1.2900	Radial Velocity	2.70
1	11 UMi b	516.22000	1.5400	Radial Velocity	1.80
2	14 And b	185.84000	0.8300	Radial Velocity	2.20
3	14 Her b	1773.40000	2.7700	Radial Velocity	0.90
4	16 Cyg B b	799.50000	1.6800	Radial Velocity	1.01
...	...	...	...	...	...
3827	tau Gem b	305.50000	1.1700	Radial Velocity	2.30
3828	ups And b	4.61711	0.0590	Radial Velocity	1.27
3829	ups And c	240.93700	0.8610	Radial Velocity	1.27
3830	ups And d	1281.43900	2.5500	Radial Velocity	1.27
3831	ups And e	3848.86000	5.2456	Radial Velocity	1.27

	name	orbital_period	semi_major_axis	detection_type	star_mass
0	11 Com b	326.03000	1.2900	Radial Velocity	2.70
1	11 UMi b	516.22000	1.5400	Radial Velocity	1.80
2	14 And b	185.84000	0.8300	Radial Velocity	2.20
3	14 Her b	1773.40000	2.7700	Radial Velocity	0.90
4	16 Cyg B b	799.50000	1.6800	Radial Velocity	1.01
...	...	...	...	...	...
3827	tau Gem b	305.50000	1.1700	Radial Velocity	2.30
3828	ups And b	4.61711	0.0590	Radial Velocity	1.27
3829	ups And c	240.93700	0.8610	Radial Velocity	1.27
3830	ups And d	1281.43900	2.5500	Radial Velocity	1.27
3831	ups And e	3848.86000	5.2456	Radial Velocity	1.27

Cleaning data with `pandas`¶

Read in the data

NaN, null, NA...

`SettingWithCopyWarning`

Check your data

Key Points:

Previous

Schedule

Next

Cleaning data with pandas¶

Read in the data

NaN, null, NA...

SettingWithCopyWarning

Check your data

Key Points:

Previous

Schedule

Next

Cleaning data with `pandas`¶

`SettingWithCopyWarning`