pandas
¶pandas
?¶Unfortunately, this does not mean finding a panda to do our coding for us (although that would be amazing and if anyone knows a panda who could do this please let me know).
Pandas is the Python Data Analysis Library. It can read pretty much anything you throw at it. If you've been putting your lab data into an excel spreadsheet, pandas can read it in so you can analyse and plot it. pandas
is used commonly in science and industries such as data science; the skills you're developing here will be useful well beyond the lab units. More information about pandas
can be found on the webpage here.
Previously we've used functions like np.loadtxt
to read in data from a file. This is fine for straightforward data files; if you know how your data is formatted and know that there's no missing data anywhere then np.loadtxt
will work OK.
However, it's often the case that we have data that isn't perfect. We may be using data that we didn't take ourselves, from an online database for example. The file could contain missing values that could trip up loadtxt
. We could be looking at a massive file and we might only want a few specific columns out of the hundreds in the file. pandas
makes dealing with things like this a lot easier.
A good example of 'imperfect' data is the Exoplanet_catalog_2019.csv file you were given for your semester 1 coursework. We're going to work with that data here. Copy the data file (download here or use the file you used last semester) to the folder you're working in and start a new notebook.
pandas
dataframes and reproducibility¶One of the main features of pandas
is the way it handles data. The main data type for pandas
is a dataframe. You can think of a dataframe like a table or an Excel spreadsheet. All the data is contained inside the dataframe and you can reference individual rows, columns and cells within the dataframe.
If dataframes are like spreadsheets, why are we using them instead of Excel? Because we want to do reproducible science. When you write a lab report and describe your experimental setup and methodology you're making your work reproducible; another person should be able to take your lab report and reproduce the experiment you did. Historically, while scientists have been good about giving details of their experiments, there has been less emphasis on the reproducibility of the data analysis. However, the tide is turning and there is now much more importance placed on reproducible research and analysis.
Here is a nice blog post about why reproducible research is important.
So why aren't we doing our analysis in Excel and just making sure that it's reproducible? Because it's hard. Excel makes it very easy to get started, typing in your data and perhaps making a quick plot, but it doesn't keep a record of what you've done. It's very easy to accidently edit your data without realising it, or to make a mistake referencing a cell and completely change your results. This isn't just a problem for science. In 2012 the bank JPMorgan lost \$6.1 billion (yes, 6.1 billion dollars) due to an Excel error.
The first thing we need to do is import pandas
. You'll often see the package name abbreviated to pd
if you're looking online, so we'll stick with that convention.
import pandas as pd
Next we can read in the file. We know it's a csv
file from the filename, so we can use the pd.read_csv
function. Similar to how we assign the result of np.loadtxt
to some variables, we assign the result of pd.read_csv
to the name of our dataframe. I'm going to call our dataframe exo_df
.
exo_df = pd.read_csv('./data/Exoplanet_catalog_2019.csv')
We can now take a look at the dataframe:
exo_df
Oooh, fancy!
One useful thing to check when you read in a file is what the names of the columns are. We can do this using
exo_df.columns
We can see above that the first column name looks a bit strange. It has a #
at the start. This was most likely put there to denote this first line as a header line. Keeping that there is going to make things difficult later on, so we can change the name of this column using rename
. We pass it the old and new names of the column as a dictionary (i.e. something of the format key:value
). inplace=True
tells it to edit the existing dataframe. If we didn't want to edit it we could create a new dataframe using
new_df = exo_df.rename(columns={'# name':'name'})
which would leave the exo_df
dataframe untouched.
Once we've used rename
we can check the column names again to make sure it's worked.
exo_df.rename(columns={'# name':'name'}, inplace=True)
exo_df.columns
Dataframes work similar to how arrays do. We can access rows or columns by indexing and slicing, just as with arrays.
However, rather than remember which column corresponds to which variable, we can pass the column name. So, if we wanted to select the orbital_period
column, we just use
exo_df['orbital_period']
We can pass several column names if we want to select multiple things. The column names can be in any order, as long as we pass an array of names inside the brackets. Remember that an array is a list enclosed in square brackets, so we'll need an extra set:
exo_df[['orbital_period', 'name', 'detection_type']]
Now try
exo_df['orbital_period', 'name', 'detection_type']
with just one set of brackets to select these columns. What does the error message say? Do you understand how the error message corresponds to what you've done? This may seem a slightly pointless exercise given that you know why the command doesn't work, but learning to interpret error messages is a crucial skill in coding. As you may have found already, error messages aren't always crystal clear and you'll sometimes have to dig a little to see what the real problem is.
What if we were only interested in the objects in the table that we detected using the Radial Velocity technique? We should be able to select those rows using the same logical conditions we've used before.
exo_df['detection_type']=='Radial Velocity'
exo_df[exo_df['detection_type']=='Radial Velocity']
Now we have all the data from all of the rows where detection_type
is Radial Velocity.