Data analysis of python (matplotlib)

Keywords: Python

The concept of data analysis: count and sort out a large amount of data, draw conclusions, and provide data support for subsequent decision-making.

Why learn matplotlib?

1. Be able to visualize the data and present it more intuitively

2. Make the data more objective and persuasive

I. Chinese display

from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False

II. Draw line chart

About the use of pyplot function in matplotlib

import matplotlib
import matplotlib.pyplot as plt
import random
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
x_1=range(60)
x_2=range(60)
y=[random.randint(20,30) for i in range(120)]
plt.figure(figsize=(200,80),dpi=80) #Set picture size and resolution
plt.plot(range(120),y,label="Qingdao") #Draw a line chart
plt.xlabel("time")#Representation of x coordinate
plt.ylabel("temperature")#Representation of y coordinate
x_label=["10 spot{}branch".format(i) for i in x_1]
x_label+=["11 spot{}branch".format(i) for i in x_2]
plt.xticks(list(range(120))[::4],x_label[::4],rotation=45)#Setting the scale rotation is to set the text rotation
plt.legend()#Show Legend 
plt.grid()#show grid
plt.title("Temperature time diagram") #show heading
plt.savefig("./picture.png")#Set to save to current directory
plt.show()
 

About image drawing style

 

  III. comparison of common statistical charts

Line chart: a statistical chart showing the increase or decrease of statistical quantity with the rise or fall of line

Features: it can display the change trend of data and reflect the change of things.  

Histogram: a series of longitudinal stripes or line segments with different heights represent the data distribution. Generally, the horizontal axis represents the data range and the vertical axis represents the distribution.

Features: draw continuous data and display the distribution of one or more groups of data (Statistics)

 

  Bar chart: data arranged in columns or rows of a worksheet can be drawn into a bar chart.

Features: draw continuous discrete data, can see the size of each data at a glance, and compare the differences between data. (Statistics)

Scatter diagram: use two groups of data to form multiple coordinate points, investigate the distribution of coordinate points, judge whether there is some correlation between the two variables, or summarize the distribution mode of coordinate points.

Features: judge whether there is quantitative correlation trend between variables and display outliers (distribution law)

IV. plot scatter diagram

import matplotlib
from matplotlib import pyplot as plt
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
a = [11,17,16,11,12,11,12,6,6,7,8,9,12,15,14,17,18,21,16,17,20,14,15,15,15,19,21,22,22,22,23]
b = [26,26,28,19,21,17,16,19,18,20,20,19,22,23,17,20,21,20,22,15,11,15,5,13,17,10,11,13,12,13,6]
x_a=[i for i in range(1,32)]
x_b=[i for i in range(51,82)]
x=x_a+x_b
plt.figure(figsize=(200,80),dpi=80)
plt.scatter(x_a,a,label="3 Monthly temperature")
plt.scatter(x_b,b,label="10 Monthly temperature")
x_label=["3 month{}day".format(i) for i in x_a]
x_label+=["10 branch{}day".format(i-50) for i in x_b]
plt.xticks(x[::3],x_label[::3],rotation=45)
plt.legend(loc="upper left")#Sets the location of the legend
plt.show()

 

V. draw bar chart

Simple independent bar chart

import matplotlib
from matplotlib import pyplot as plt
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
a = ["Warwolf 2","Speed and passion 8","Kung Fu Yoga","Journey to the West subdues demons chapter","Transformers 5: The Last Knight",\
    "Wrestle, Dad","Pirates of the Caribbean 5: dead without proof","King Kong: Skeleton Island","Extreme agent: ultimate return","Biochemical crisis 6: final chapter",\
"ride the wind and waves","Stealing daddy 3","Outwit Weihu mountain","Make trouble in Tianzhu","Wolverine 3: battle to death","Spider Man: Hero returns","Wukong biography","Galaxy guard 2",\
     "Love saint","The Mummy ",]
b=[56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,11.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23]
plt.figure(figsize=(200,80),dpi=80)
x=range(len(a))
plt.bar(x,b,width=0.2,color="orange")
plt.xticks(x,a,rotation=270)
plt.show()

import matplotlib
from matplotlib import pyplot as plt
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
a=["Warwolf 2","Speed and passion 8","Kung Fu Yoga","Journey to the West subdues demons chapter","Transformers 5: The Last Knight","Wrestle, Dad","Pirates of the Caribbean 5: dead without proof","King Kong: Skeleton Island","Extreme agent: ultimate return","Biochemical crisis 6: final chapter","ride the wind and waves","Stealing daddy 3","Outwit Weihu mountain","Make trouble in Tianzhu","Wolverine 3: battle to death","Spider Man: Hero returns","Wukong biography","Galaxy guard 2","Love saint","The Mummy ",]
b=[56.01,26.94,17.53,16.49,15.45,12.96,11.8,11.61,11.28,11.12,10.49,10.3,8.75,7.55,7.32,6.99,6.88,6.86,6.58,6.23]
plt.barh(range(len(a)),b,height=0.2)#Draw a horizontal bar chart
plt.yticks(range(len(a)),a)
plt.legend(loc="upper left")#Sets the location of the legend
plt.show()

Examples  

Suppose you know the box office of the films in list a on September 14, 2017 (b_14), September 15, 2017 (b_15) and September 16, 2017 (b_16), how can you present the data more intuitively in order to show the box office of the films in the list and the comparison with the data of other films?

a = ["the rise of the scarlet ball 3: the ultimate battle", "Dunkirk", "spider man: the return of heroes", "warwolf 2"]

b_16 = [15746,312,4497,319]

b_15 = [12357,156,2045,168]

b_14 = [2358,399,2358,362]

import matplotlib
from matplotlib import pyplot as plt
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
a = ["The rise of the scarlet ball 3: the ultimate battle","The rise of the scarlet ball 3: the ultimate battle","The rise of the scarlet ball 3: the ultimate battle",
     "Dunkirk","Dunkirk","Dunkirk",
     "Spider Man: Hero returns","Spider Man: Hero returns","Spider Man: Hero returns",
     "Warwolf 2","Warwolf 2","Warwolf 2"]

b_16 = [15746,312,4497,319]
b_15 = [12357,156,2045,168]
b_14 = [2358,399,2358,362]
bar_width=0.1
x_14=[i for i in range(4)]
x_15=[i+bar_width for i in range(4)]
x_16=[i+bar_width*2 for i in range(4)]
x=x_14+x_15+x_16
plt.figure(figsize=(20,8),dpi=80)
plt.bar(x_14,b_14,width=bar_width,label="14 number")
plt.bar(x_15,b_15,width=bar_width,label="15 number")
plt.bar(x_16,b_16,width=bar_width,label="16 number")
plt.xticks(x,a,rotation=90)
plt.legend()
plt.grid()
plt.xlabel("film")
plt.ylabel("Viewing times")
plt.title("Beijing cinema statistics")
plt.show()

The code is incorrect, only for demonstration effect

Vi. draw histogram

Histogram is used to count the quantity. It is used only when the data quantity is not counted

Suppose you have obtained the duration of 250 movies (in list a) and want to make statistics on the distribution status of the duration of these movies (such as the number and frequency of movies with a duration of 100 minutes to 120 minutes). How should you present these data?  

import matplotlib
from matplotlib import pyplot as plt
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
a=[131,  98, 125, 131, 124, 139, 131, 117, 128, 108, 135, 138, 131, 102, 107, 114, 119, 128, 121, 142, 127, 130, 124, 101, 110, 116, 117, 110, 128, 128, 115,  99, 136, 126, 134,  95, 138, 117, 111,78, 132, 124, 113, 150, 110, 117,  86,  95, 144, 105, 126, 130,126, 130, 126, 116, 123, 106, 112, 138, 123,  86, 101,  99, 136,123, 117, 119, 105, 137, 123, 128, 125, 104, 109, 134, 125, 127,105, 120, 107, 129, 116, 108, 132, 103, 136, 118, 102, 120, 114,105, 115, 132, 145, 119, 121, 112, 139, 125, 138, 109, 132, 134,156, 106, 117, 127, 144, 139, 139, 119, 140,  83, 110, 102,123,107, 143, 115, 136, 118, 139, 123, 112, 118, 125, 109, 119, 133,112, 114, 122, 109, 106, 123, 116, 131, 127, 115, 118, 112, 135,115, 146, 137, 116, 103, 144,  83, 123, 111, 110, 111, 100, 154,136, 100, 118, 119, 133, 134, 106, 129, 126, 110, 111, 109, 141,120, 117, 106, 149, 122, 122, 110, 118, 127, 121, 114, 125, 126,114, 140, 103, 130, 141, 117, 106, 114, 121, 114, 133, 137,  92,121, 112, 146,  97, 137, 105,  98, 117, 112,  81,  97, 139, 113,134, 106, 144, 110, 137, 137, 111, 104, 117, 100, 111, 101, 110,105, 129, 137, 112, 120, 113, 133, 112,  83,  94, 146, 133, 101,131, 116, 111,  84, 137, 115, 122, 106, 144, 109, 123, 116, 111,111, 133, 150]
plt.figure(figsize=(20,8),dpi=80)
d=3
num_bins=(max(a)-min(a))//d
plt.hist(a,num_bins,normed=True)#Normalized can change the frequency to a percentage
plt.xticks(range(min(a),max(a)+d,d))
plt.grid()
plt.show()

  So here comes the question...

In the United States, the 2004 census found that 124 million s work far away from home. According to the time they need from home to work, the data in the following table are listed through sampling statistics (the last column). Can these data be drawn into a histogram?

The answer is no, because statistical data cannot be used to draw histogram, so bar graph can be used to replace histogram

interval = [0,5,10,15,20,25,30,35,40,45,60,90]

width = [5,5,5,5,5,5,5,5,5,15,30,60]

quantity = [836,2737,3723,3926,3596,1438,3273,642,824,613,215,47]

import matplotlib
from matplotlib import pyplot as plt
from pylab import mpl
mpl.rcParams["font.sans-serif"]=["SimHei"]
mpl.rcParams["axes.unicode_minus"]=False
interval = [0,5,10,15,20,25,30,35,40,45,60,90]
width = [5,5,5,5,5,5,5,5,5,15,30,60]
quantity = [836,2737,3723,3926,3596,1438,3273,642,824,613,215,47]
plt.figure(figsize=(20,9),dpi=80)
plt.bar(range(len(quantity)),quantity,width=1)
x=[i-0.5 for i in range(13)]#Adjust the scale of x
plt.xticks(x,interval+[150])
plt.show()

Posted by stefharley on Fri, 01 Oct 2021 16:16:54 -0700