Python advanced: data visualization with Matplotlib

brief introduction

When editing CSDN, the picture is easy to be missing. You can pay attention to "lazy programming" to get a better reading experience.

Next, we use Matplotlib to visualize data. First, we continue to use Matplotlib to draw images. The official account can reply to data2 to get the code and data of this article.

Box plot

Boxplot is also known as box whisk plot. It uses five statistics in the data: minimum, first quartile, median, third quartile and maximum to describe the data, as shown in the following figure:

[failed to transfer the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-vema9zoi-1582165641699)(https://raw.githubusercontent.com/ayuLiao/images/master/20190826100715.png))

1. Minimum and maximum are the minimum and maximum values in the data

2. Median: sort the data and find the number of intermediate positions, which is called median. If there are two numbers in the intermediate position, add them and divide them by 2. If there are numbers 1, 2, 4, 5, 7, 7, 8, 9, then the median is: (5 + 7) / 2 = 6

3.lower quartile, also known as the first quartile: it is the median of data on the left side of the median after data sorting. If there are numbers 1,2,4,5,7,7,8,9, then the first quartile is the median of 1,2,4, then it is 2

4.upper quartile, also known as the third quartile: it is the median of the data on the right side of the median after data sorting. If there is a median of 1,2,4,5,7,7,8,9 in the third quartile, it is 8

5.IQR(Inter Quartile Range), the data from the first quartile to the third quartile, estimates 50% of the data in the middle. The above figure does not show IQR

5.outlier, also known as outlier, if a value is less than (first quartile - 1.5*IQR) or greater than (third quartile + 1.5*IQR), it is called outlier

Using the box line graph, we can roughly judge whether the data has symmetry and the dispersion degree of data distribution. Let's draw it.

At the beginning, the data is still read and processed simply

exam_data = pd.read_csv('datasets/exams.csv')
# Extract only test score related information
exam_scores = exam_data[['math score', 'reading score', 'writing score']]
exam_scores.head()

In order to draw boxplot conveniently, the data is converted to array type in numpy

exam_scores_array = np.array(exam_scores)

Drawing boxplot with matplotlib

colors = ['blue', 'grey', 'lawngreen']

bp = plt.boxplot(exam_scores_array,  # data
                 patch_artist=True, # If the patch ABCD artist is set to True, different colors can be set later
                 notch=True) # Show if there are grooves

for i in range(len(bp['boxes'])):
    bp['boxes'][i].set(facecolor=colors[i]) # Set the color if the patch [artist] is set to True
    bp['caps'][2*i + 1].set(color=colors[i])
    
plt.xticks([1, 2, 3], ['Math', 'Reading', 'Writing'])

plt.show()

[failed to transfer the pictures in the external chain. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-4RhELmOp-1582165641701)(https://raw.githubusercontent.com/ayuLiao/images/master/20190826105404.png))

Violin plot (ViolinPlot)

Violin plot is used to display data distribution and probability density. It combines the characteristics of box plot and density plot.

The image is as follows:

95% confidence interval(95% confidence interval) refers to the extended black thin line in the figure
Density plot
Median
Interquartile range
Split densities by category

Drawing violin diagram with data of drawing box line diagram

vp = plt.violinplot(exam_scores_array,
                    showmedians=True)

plt.xticks([1, 2, 3], ['Math', 'Reading', 'Writing'])

for i in range(len(vp['bodies'])):
    vp['bodies'][i].set(facecolor=colors[i])

plt.show()

It can be seen from the figure that the data distribution density in the middle part of the violin figure is higher, which indicates that most of the students' scores are near the average level

Twinaxis plot

Biaxial graph, as the name implies, is a graph with two y axes. When our data use the same x axis, we can consider drawing a biaxial graph.

We can intuitively feel the correlation between the two kinds of data through the double axis graph. For example, the population data and GDP data are on the same time axis (x axis). At this time, we can use the double axis graph to judge whether there is correlation between the two changes.

Here, we use the weather data of Austin (the capital of Texas) town to draw a two-axis map, mainly using the two columns of data of average temperature and average wind speed to determine whether there is any connection between the two

First of all, it is still to read in the data and take the data needed

austin_weather = pd.read_csv('datasets/austin_weather.csv')
austin_weather.head()
# Data date
# TempAvgF average temperature, Fahrenheit
# WindAvgMPH average wind speed in MPH
austin_weather = austin_weather[['Date', 'TempAvgF', 'WindAvgMPH']].head(5)
pritn(austin_weather)

[failed to transfer and store the pictures in the external chain. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-jNzdbgBB-1582165641711)(https://raw.githubusercontent.com/ayuLiao/images/master/20190826132919.png))

Use this data to draw a biaxial graph

 # Create subgraph
fig, ax_tempF = plt.subplots()  

#fig=plt.figure(figsize=(12,6)) can achieve the same effect
fig.set_figwidth(12)
fig.set_figheight(6)

# Set x label
ax_tempF.set_xlabel('Date')

ax_tempF.tick_params(axis = 'x',
                    bottom=False, # Disable ticks
                    labelbottom=False # Disable x-axis labels
                    ) 

# Set left Y-axis label
ax_tempF.set_ylabel('Temp (F)', 
                    color='red',
                    size='x-large')

# Set labelcolor and labelsize for the left Y-axis label
ax_tempF.tick_params(axis='y', 
                     labelcolor='red', 
                     labelsize='large')

# Draw AvgTemp on the left Y axis
ax_tempF.plot(austin_weather['Date'], 
              austin_weather['TempAvgF'], 
              color='red')

# Set the same x axis for two graphs
ax_precip = ax_tempF.twinx()

#Set right Y-axis label
ax_precip.set_ylabel('Avg Wind Speed (MPH)', 
                     color='blue', 
                     size='x-large') 

# Set labelcolor and labelsize for the right Y-axis label
ax_precip.tick_params(axis='y', 
                      labelcolor='blue',
                      labelsize='large')

# Draw WindAVg on the right Y axis
ax_precip.plot(austin_weather['Date'], 
         austin_weather['WindAvgMPH'], 
         color='blue')

fig.legend(loc=1, bbox_to_anchor=(1,1), bbox_transform=ax_tempF.transAxes)

plt.show()

It can be seen from the figure that there is some relationship between the two, but the average temperature is not only affected by the average wind speed.

Stack plot

Stack graph is a special area graph, which can be used to compare multiple variables in an interval. Different from ordinary area graph, the starting point of drawing each data area of stack graph is based on the previous data area.

Here we use the data from the National Park to draw the stacking map

np_data= pd.read_csv('datasets/national_parks.csv')
print(np_data.head())

National Parks data include badlands, Grand Canyon and brycecanyons.

In order to draw the stacking map, we first need to integrate the land area data of Category 3 into a two-dimensional array. Here, we directly use numpy's vstack() method to achieve this effect. A simple example of the vstack() method is as follows:

import numpy as np
a=[1,2,3]
b=[4,5,6]
print(np.vstack((a,b)))

//Output:
[[1 2 3]
 [4 5 6]]

Then we use the national parks data to draw the stacking diagram

x = np_data['Year']
y = np.vstack([np_data['Badlands'], 
               np_data['GrandCanyon'], 
               np_data['BryceCanyon']])
               
# Label for each area area
labels = ['Badlands', 
          'GrandCanyon', 
          'BryceCanyon']

# Color of each area area
colors = ['sandybrown', 
          'tomato', 
          'skyblue']

# Similar to pandas's df.plot.area()
# stackplot() to create a stackplot
plt.stackplot(x, y, 
              labels=labels,
              colors=colors, 
              edgecolor='black')

# Drawing annotation
plt.legend(loc=2)

plt.show()

[failed to transfer the pictures in the external chain. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-AO7gEKm9-1582165641719)(https://raw.githubusercontent.com/ayuLiao/images/master/20190826140727.png))

Percentage stack

The percentage stack graph is similar to the common stack graph, except that each data is converted to the corresponding percentage and then drawn into the graph. The national parks data is still used to draw the percentage stack graph

plt.figure(figsize=(10,7))

# divide function: only integer part is reserved in integer and floating-point division
data_perc = np_data.divide(np_data.sum(axis=1), axis=0)

plt.stackplot(x,
              data_perc["Badlands"],data_perc["GrandCanyon"],data_perc["BryceCanyon"],
              edgecolor='black',
              colors=colors,
              labels=labels)

plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()

The effect is as follows:

[failed to transfer and save the pictures in the external chain. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-uxdygxs-1582165641720) (https://raw.githubusercontent.com/ayuliao/images/master/20190826141312. PNG))

Ending

This article introduces part of the use of maplotlib visualization data, and the use of maplotlib to draw other graphs in the next article. Please pay attention to HackPython.

Lazy programming - er Liang

Published 11 original articles, won praise 1, visited 1016

Private letter follow

Posted by netman182 on Wed, 19 Feb 2020 18:53:39 -0800

Programmer Group