pandas
¶Questions
Objectives
pandas
In the previous section we learned how to read data in with pandas
and how to access the data within a dataframe. Here we will be looking at how to "clean" our data while still staying within the bounds of reproducible science.
import pandas as pd
exo_df = pd.read_csv("./data/Exoplanet_catalog_2019.csv")
exo_df.rename(columns={'# name':'name'}, inplace=True)
exo_df
As you may remember from your Semester 1 coursework, there are some gaps in the original csv file. Not every object has data available for every variable. This is a hassle when we're using something like np.loadtxt
to read in the data, but pandas
can handle missing variables using the NaN
value.
NaN
stands for 'Not a Number' and is the value that pandas
fills your empty cells with.
We can select all the rows in a dataframe that contain a NaN
value using the isna()
function. This is another logical operation; we're selecting the rows that have a NaN
value. Because we want to select all the rows with a NaN
and don't care which column it's in, we use .any(axis=1)
, where axis=1
says to do the operation on each row.
exo_df[exo_df.isna().any(axis=1)]
At the bottom of the output we can see the size of the dataframe that's returned - 2866 rows by 11 columns. Compare that to the original size of our dataframe... that's a lot of NaN
s!
It's handy to check which rows have missing data, but what if we're only concerned with those that aren't missing data? Then we can do the inverse to return only the rows without NaN
values.
We can get this in almost exactly the same command, just one character different. We change
exo_df.isna()
to ~exo_df.isna()
. The ~
is another way of saying `not.
exo_df[~exo_df.isna().any(axis=1)]
That's a lot less rows. Oh dear. But what if we don't care about some of the columns with missing data?
In your coursework you were asked to only look at the planet name, period, semi-major axis, detection technique and star mass columns. So let's create a new dataframe that only contains these columns. I'll call this one planet_df
First we need to check the names of the columns we want, then we can make our new dataframe using these names:
exo_df.columns
planet_df = exo_df[['name', 'orbital_period', 'semi_major_axis', 'detection_type', 'star_mass']]
planet_df
Now we can check how many of these rows and columns contain NaN
values. Rather than print out the dataframe and read the number underneath, we can use the shape
attribute.
planet_df[planet_df.isna().any(axis=1)].shape
This dataframe has fewer rows with missing data than the first, but still has a lot. We only care about the rows that don't have any missing data. We could use the same operation as we did before to return only the complete rows, but an easier way to do it is to use dropna
. This removes any rows with missing data. If we'd done this on the original dataframe we'd have lost a lot of rows that are perfectly fine for our analysis but have missing data in another column. By using our new dataframe we don't remove data unnecessarily. Setting inplace=True
tells Python to edit the dataframe rather than just show us the results.
planet_df.dropna(inplace=True)
Let's take another look at our planet_df
dataframe:
planet_df
It definately has fewer rows than before, but we should still check that they're all complete...