RaelK
- Feb 19, 2020
- 4 min read

Interesting what pandas do!

Updated: Apr 26, 2020

You are not lost! You are here!

Pandas is a Python programming language software library for data manipulations and analysis. To keep things simple, We can think of pandas as an extremely powerful version of Excel, with a lot more features.

A very good friend of mine, Cate Gitau says that learning never stops/ends for a data scientist. Everyday we are learning new things/techniques, everyday developers are building new software and we must stretch from our place of comfort at advance with technology. That said, in as much as I would love to, I cannot say every basic thing we need to learn here, else this blog will never end, but I can promise to give as much necessary highlights as is possible. I will highly request/recommend that you read on pandas python data types; series and data frames. (They are pretty easy to understand I promise).

As I promised in my last blog, in this blog we will start our analysis with a data set from Kaggle titled 'SF Salaries'. What I need to clearly establish is that there is no one particular and/or perfect way to coding, specially in python, its about how you think through the problem and 'program' the computer handle to it, that means that there could be a thousand correct ways, even more to code a particular problem.(Don't feel limited to my lines of code).

The very first and basic thing to do is to load/import the necessary libraries useful for our analysis.

NB. We may not necessary need numpy for this analysis, but I always import it just in case.

Next we load the .csv(excel) data as a data frame.

It is always good practice to check the head of our data frame just to be certain we are working on the correct/intended data.

Also, it is important that one understands the size of data they are working with. To know the number of entries contained in our data, we use the .info() method as below,

But we will strongly agree that just knowing the number of entries in a given data set is of no much use if we do not have a clear description of that data set. To accomplish this, we use .describe(). This gives the ‘distribution’ of data based on a five number summary (“minimum”, first quantile (25%), median(50%), third quantile (75%), and “maximum”) In addition the standard deviation.

(The transpose is optional, you can try it in your IDLE without the transpose and see what you get)

We can perform some data operations like computing means, averages, maximums, minimums and so on by slicing the column of interest and using the dot(.) operation, lets see in an example for computing the average 'BasePay' from our data set.

It is also possible to extract any detail pertaining a particular individual from the data set with ease. For instance we can get the job title of say PATRICK GARDNER from the data set.

NB. while doing this, It is important to remember that python is case sensitive.

Other operations we could do is group, merge and sort columns in the data set.

For example: What was the average of all employees per year? (2011-2014) ?

But we notice that this is inaccurate because the code/operation returns the mean of all the columns in the data set including the average of the ID column, which should not be the case, ID is supposed to be a unique identifier thus ‘forbidden’ to compute its average. We require more specificity. For instance;

What was the average Benefits of all employees per year? (2011-2014) ?

Notice that in 2011 we had missing data values in the Benefits. There are several ways to deal with missing values in data, we will see them as we proceed.

Another operation we can do is count. We can very easily count the top 5 most common jobs,

What 'value_count' does, it counts all the entries pertaining a particular job in the data set under the ‘JobTitle’ column and sorts them in ascending order. To get the top 5 jobs all we need to do is subset/slice the first five columns as above. To get the least common jobs, the trick will be to use a negative in the slicing ([-5 : ]).

Most interestingly, we can also compute the relationship/correlation between different columns or distinct variables in the data set.

Is there a correlation between length of the Job Title string and Salary? Well, with python that’s an easy-peasy. First we need to create a new column 'title_len' and compute the length of each respective job title as follows,

We observe a correlation of -0.036878, what this means is that there is very very weak negative correlation between the length of a Job Title and Salary. In fact one can claim that there is no correlation at all between these two variables/factors. It is just as we expect, that one has a very lengthy/long job title doesn’t imply that their salary is big or vise versa.

Was that not interesting? I hope it was. Told you you would love panda’s ):

Conclusion.

This is just but a few, a lot more can be done with this data set which I would love for you to try out, I will be more that wiling to help where need be.

Like I said you can get the data set at Kaggle, or you can request for it by sending me a mail requesting for it in the comment box. You can also give suggestions of what you need us to discuss next. I will be glad to hear from your end.

categitau.com has done well on a blog on ‘Must Haves: 10 Data Science Books’

very good recommendations of books to catch a glimpse. Like she says, 'learning never stops for a data scientist'.

Recommendations.

I would highly recommend that you try the codes in your own Jupiter notebook. 'Even the longest journey starts with a single step'.

Interesting what pandas do!

Recent Posts

Rachael