Data Visualization with Pandas

Introduction

Pie Charts with Pandas

It is seldom a good idea to present your scientific or business data solely in rows and columns of numbers. We rather use various kinds of diagrams to visualize our data. This makes the communication of information more efficiently and easy to grasp. In other words, it makes complex data more accessible and understandable. The numerical data can be graphically encoded with line charts, bar charts, pie charts, histograms, scatterplots and others.

We have already seen the powerful capabilities of for creating publication-quality plots. Matplotlib is a low-level tool to achieve this goal, because you have to construe your plots by adding up basic components, like legends, tick labels, contours and so on. Pandas provides various plotting possibilities, which make like a lot easier.

We will start with an example for a line plot.

Line Plot in Pandas

Series

Both the Pandas Series and DataFrame objects support a plot method.

You can see a simple example of a line plot with for a Series object. We use a simple Python list "data" as the data for the range. The index will be used for the x values, or the domain.

import pandas as pd
data = [100, 120, 140, 180, 200, 210, 214]
s = pd.Series(data, index=range(len(data)))
s.plot()
The above Python code returned the following result:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173ed1630>

It is possible to suppress the usage of the index by setting the keyword parameter "use_index" to False. In our example this will give us the same result:

s.plot(use_index=False)
This gets us the following output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173ea29e8>

We will experiment now with a Series which has an index consisting of alphabetical values.

fruits = ['apples', 'oranges', 'cherries', 'pears']
quantities = [20, 33, 52, 10]
S = pd.Series(quantities, index=fruits)
S.plot()
This gets us the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173e6c7f0>

Line Plots in DataFrames

We will introduce now the plot method of a DataFrame. We define a dcitionary with the population and area figures. This dictionary can be used to create the DataFrame, which we want to use for plotting:

import pandas as pd
cities = {"name": ["London", "Berlin", "Madrid", "Rome", 
                   "Paris", "Vienna", "Bucharest", "Hamburg", 
                   "Budapest", "Warsaw", "Barcelona", 
                   "Munich", "Milan"],
          "population": [8615246, 3562166, 3165235, 2874038,
                         2273305, 1805681, 1803425, 1760433,
                         1754000, 1740119, 1602386, 1493900,
                         1350680],
          "area" : [1572, 891.85, 605.77, 1285, 
                    105.4, 414.6, 228, 755, 
                    525.2, 517, 101.9, 310.4, 
                    181.8]
}
city_frame = pd.DataFrame(cities,
                          columns=["population", "area"],
                          index=cities["name"])
print(city_frame)
           population     area
London        8615246  1572.00
Berlin        3562166   891.85
Madrid        3165235   605.77
Rome          2874038  1285.00
Paris         2273305   105.40
Vienna        1805681   414.60
Bucharest     1803425   228.00
Hamburg       1760433   755.00
Budapest      1754000   525.20
Warsaw        1740119   517.00
Barcelona     1602386   101.90
Munich        1493900   310.40
Milan         1350680   181.80

The following code plots our DataFrame city_frame. We will multiply the area column by 1000, because otherwise the "area" line would not be visible or in other words would be overlapping with the x axis:

city_frame["area"] *= 1000
city_frame.plot()
The above code returned the following output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173e54588>

This plot is not coming up to our expectations, because not all the city names appear on the x axis. We can change this by defining the xticks explicitly with "range(len((city_frame.index))". Furthermore, we have to set use_index to True, so that we get city names and not numbers from 0 to len((city_frame.index):

city_frame.plot(xticks=range(len(city_frame.index)),
                use_index=True)
After having executed the Python code above we received the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173db98d0>

Now, we have a new problem. The city names are overlapping. There is remedy at hand for this problem as well. We can rotate the strings by 90 degrees. The names will be printed vertically afterwards:

city_frame.plot(xticks=range(len(city_frame.index)),
                use_index=True, 
                rot=90)
The code above returned the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173db9898>

Using Twin Axes

We multiplied the area column by 1000 to get a proper output. Instead of this, we could have used twin axes. We will demonstrate this in the following example. We will recreate the city_frame DataFrame to get the original area column:

city_frame = pd.DataFrame(cities,
                          columns=["population", "area"],
                          index=cities["name"])
print(city_frame)
           population     area
London        8615246  1572.00
Berlin        3562166   891.85
Madrid        3165235   605.77
Rome          2874038  1285.00
Paris         2273305   105.40
Vienna        1805681   414.60
Bucharest     1803425   228.00
Hamburg       1760433   755.00
Budapest      1754000   525.20
Warsaw        1740119   517.00
Barcelona     1602386   101.90
Munich        1493900   310.40
Milan         1350680   181.80

To get a twin axes represenation of our diagram, we need subplots from the module matplotlib and the function "twinx":

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.suptitle("City Statistics")
ax.set_ylabel("Population")
ax.set_xlabel("Cities")
ax2 = ax.twinx()
ax2.set_ylabel("Area")
city_frame["population"].plot(ax=ax, 
                              style="b-",
                              xticks=range(len(city_frame.index)),
                              use_index=True, 
                              rot=90)
city_frame["area"].plot(ax=ax2, 
                        style="g-",
                        use_index=True, 
                        rot=90)
fig.legend()
After having executed the Python code above we received the following:
<matplotlib.legend.Legend at 0x7fc173c1ceb8>

We can also create twin axis directly in Pandas without the aid of Matplotlib. We demonstrate this in the code of the following program:

import matplotlib.pyplot as plt
ax1= city_frame["population"].plot(style="b-",
                                   xticks=range(len(city_frame.index)),
                                   use_index=True, 
                                   rot=90)
ax2 = ax1.twinx()
#ax2.spines['right'].set_position(('axes', 1.0))
city_frame["area"].plot(ax=ax2,
                        style="g-",
                        use_index=True,
                        #secondary_y=True,
                        rot=90)
ax1.legend(loc = (.7,.9), frameon = False)
ax2.legend( loc = (.7, .85), frameon = False)
plt.show()

Multiple Y Axes

Let's add another axes to our city_frame. We will add a column with the population density, i.e. the number of people per square kilometre:

city_frame["density"] = city_frame["population"] / city_frame["area"]
city_frame
The above code returned the following:
population area density
London 8615246 1572.00 5480.436387
Berlin 3562166 891.85 3994.131300
Madrid 3165235 605.77 5225.143206
Rome 2874038 1285.00 2236.605447
Paris 2273305 105.40 21568.358634
Vienna 1805681 414.60 4355.236372
Bucharest 1803425 228.00 7909.758772
Hamburg 1760433 755.00 2331.699338
Budapest 1754000 525.20 3339.680122
Warsaw 1740119 517.00 3365.800774
Barcelona 1602386 101.90 15725.083415
Munich 1493900 310.40 4812.822165
Milan 1350680 181.80 7429.482948

Now we have three columns to plot. For this purpose, we will create three axes for our values:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
fig.suptitle("City Statistics")
ax.set_ylabel("Population")
ax.set_xlabel("Citites")
ax_area, ax_density = ax.twinx(), ax.twinx() 
ax_area.set_ylabel("Area")
ax_density.set_ylabel("Density")
rspine = ax_density.spines['right']
rspine.set_position(('axes', 1.25))
ax_density.set_frame_on(True)
ax_density.patch.set_visible(False)
fig.subplots_adjust(right=0.75)
city_frame["population"].plot(ax=ax, 
                              style="b-",
                              xticks=range(len(city_frame.index)),
                              use_index=True, 
                              rot=90)
city_frame["area"].plot(ax=ax_area, 
                        style="g-",
                        use_index=True, 
                        rot=90)
city_frame["density"].plot(ax=ax_density, 
                           style="r-",
                           use_index=True, 
                           rot=90)
This gets us the following output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173aa1470>

A More Complex Example

We use the previously gained knowledge in the following example. We use a file with visitor statistics from our website python-course.eu. The content of the file looks like this:

Month Year  "Unique visitors"   "Number of visits"  Pages   Hits    Bandwidth Unit
Jun 2010    11  13  42  290 2.63 MB
Jul 2010    27  39  232 939 9.42 MB
Aug 2010    75  87  207 1,096   17.37 MB
Sep 2010    171 221 480 2,373   39.63 MB
...
Nov 2016    234,518 374,641 832,244 4,378,623   167.68 GB
Dec 2016    209,367 323,845 598,081 3,627,830   145.41 GB
Jan 2017    219,153 346,011 633,984 3,827,909   158.36 GB
Feb 2017    255,869 409,503 752,516 4,630,365   189.43 GB
Mar 2017    284,557 467,802 891,505 5,306,521   221.30 GB
%matplotlib inline
import pandas as pd
data_path = "data1/"
data = pd.read_csv(data_path + "python_course_monthly_history.txt", 
                   quotechar='"',
                   thousands=",",
                   delimiter=r"\s+")
def unit_convert(x):
    value, unit = x
    if unit == "MB":
        value *= 1024
    elif unit == "GB":
        value *= 1048576 # i.e. 1024 **2
    return value
b_and_u= data[["Bandwidth", "Unit"]]
bandwidth = b_and_u.apply(unit_convert, axis=1)
del data["Unit"]
data["Bandwidth"] = bandwidth
month_year =  data[["Month", "Year"]]
month_year = month_year.apply(lambda x: x[0] + " " + str(x[1]), 
                              axis=1)
data["Month"] = month_year
del data["Year"]
data.set_index("Month", inplace=True)
del data["Bandwidth"]
data[["Unique visitors", "Number of visits"]].plot(use_index=True, 
                                                   rot=90,
                                                   xticks=range(1, len(data.index),4))
The previous code returned the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173a05400>
ratio = pd.Series(data["Number of visits"] / data["Unique visitors"],
                  index=data.index)
ratio.plot(use_index=True, 
           xticks=range(1, len(ratio.index),4),
           rot=90)
After having executed the Python code above we received the following result:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1739eb5c0>

Converting String Columns to Floats

In the folder "data1", we have a file called programming_language_usage.txt with a ranking of programming languages by usage. The data has been collected and created by TIOBE in March 2017.

The file looks like this:

"Mar 2017"  "Language"  Percentage
1       Java    16.384%
2       C   7.742%  
3       C++ 5.184%  
4       C#  4.409%  
5       Python  3.919%  
6       Visual Basic .NET   3.174%  
7       PHP 3.009%  
8       JavaScript  2.667%  
9       Delphi/Object Pascal    2.544%  

The percentage column contains strings with a percentage sign. We can get rid of this when we read in the data with read_csv. All we have to do is define a converter function, which we to read_csv via the converters dictionary, which contains column names as keys and references to functions as values.

def strip_percentage_sign(x):
    return float(x.strip('%'))
data_path = "data1/"
progs = pd.read_csv(data_path + "programming_language_usage.txt", 
                   quotechar='"',
                   thousands=",",
                   index_col=1,
                   converters={'Percentage':strip_percentage_sign},
                   delimiter=r"\s+")
del progs["Mar 2017"]
print(progs)
progs.plot(xticks=range(1, len(progs.index),4),
           use_index=True, rot=90)
                      Percentage
Language                        
Java                      16.384
C                          7.742
C++                        5.184
C#                         4.409
Python                     3.919
Visual Basic .NET          3.174
PHP                        3.009
JavaScript                 2.667
Delphi/Object Pascal       2.544
Swift                      2.268
Perl                       2.261
Ruby                       2.254
Assembly language          2.232
R                          2.016
Visual Basic               2.008
Objective-C                1.997
Go                         1.982
MATLAB                     1.854
PL/SQL                     1.672
Scratch                    1.472
The above Python code returned the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc17388f940>

Bar Plots in Pandas

To create bar plots with Pandas is as easy as plotting line plots. All we have to do is add the keyword parameter "kind" to the plot method and set it to "bar".

A Simple Example

import pandas as pd
data = [100, 120, 140, 180, 200, 210, 214]
s = pd.Series(data, index=range(len(data)))
s.plot(kind="bar")
We received the following result:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1734d1390>

Bar Plot for Programming Language Usage

Let's get back to our programming language ranking. We will printout now a bar plot of the six most used programming languages:

progs[:6].plot(kind="bar")
After having executed the Python code above we received the following output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173845ef0>

Now the whole chart with all programming languages:

progs.plot(kind="bar")
The previous code returned the following result:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1737c5860>

Colorize A Bar Plot

It is possible to colorize the bars indivually by assigning a list to the keyword parameter color:

my_colors = ['b', 'r', 'c', 'y', 'g', 'm']
progs[:6].plot(kind="bar",
               color=my_colors)
The previous Python code returned the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173867da0>

Pie Chart Diagrams in Pandas

A simple example:

import pandas as pd
fruits = ['apples', 'pears', 'cherries', 'bananas']
series = pd.Series([20, 30, 40, 10], 
                   index=fruits, 
                   name='series')
series.plot.pie(figsize=(6, 6))
The Python code above returned the following:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc173651710>
fruits = ['apples', 'pears', 'cherries', 'bananas']
series = pd.Series([20, 30, 40, 10], 
                   index=fruits, 
                   name='series')
explode = [0, 0.10, 0.40, 0.7]
series.plot.pie(figsize=(6, 6),
                explode=explode)
The above code returned the following output:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc17362a630>

We will replot the previous bar plot as a pie chart plot:

import matplotlib.pyplot as plt
my_colors = ['b', 'r', 'c', 'y', 'g', 'm']
progs.plot.pie(subplots=True,
               legend=False)
The Python code above returned the following:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fc1735e0710>],
      dtype=object)

It looks ugly that we see the y label "Percentage" inside our pie plot. We can remove it by calling "plt.ylabel('')"

import matplotlib.pyplot as plt
my_colors = ['b', 'r', 'c', 'y', 'g', 'm']
progs.plot.pie(subplots=True,
               legend=False)
plt.ylabel('')
The above code returned the following:
Text(0,0.5,'')