Task 1: data collection
Web page“ http://pm25.in/beijing ”It contains the air quality monitoring data of 12 monitoring points in Beijing. Please write a program to capture the monitoring points, AQI and air quality index categories on the web page (the web page samples are saved in the src1 directory under the source material folder), as shown in table H2-1-1. Save the captured data and name it with bj20200721.csv file.
Table H2-1-1 monitoring data of air quality in Beijing
AQI air quality index category of monitoring points
Longevity Palace 57 Liang
...... ...... ......
1) Use pychart to create the project task030101 under the examinee folder, and create a python file under the project, named task1.py. Copy the src1 directory under the source material folder to the project task030101.
2) Carry out problem analysis and write notes according to the actual project requirements.
3) Write a program to correctly crawl the data in the web page, run the program and display "crawled".
4) Write a program, save the captured data under the project task030101, name it with bj20200721.csv file, and the separator is tab key, run the program and display "saved".
5) Save the screenshot of the running result in the examinee folder and name it with task1.jpg file.
**
answer:
import bs4 from bs4 import BeautifulSoup from urllib import request import pandas ## Parsing web content # ulist: save the parsed web page content, html: the incoming web page content def jiexi(ulist,html): soup = BeautifulSoup(html, 'html.parser') ##Create a beautiful soup object and specify that the parser of TreeBuilder in beautiful soup is html.parser for tr in soup.find('tbody').children: ## All tr tags under the loop tag tbody, an iteration object if isinstance(tr,bs4.element.Tag): ## Determine whether bs4.element.Tag type tds = tr('td') ## Get the data in this row ulist.append([tds[0].string,tds[1].string,tds[2].string]) ## Add to three ## Analyze the data and wait for the two-dimensional table def tofile(ulist): data = pandas.DataFrame(ulist) ##Data frame, making two-dimensional table data.columns =['Monitoring point','AQI','Air quality index category'] ## Set column name data.to_csv('./bj20200721.csv',header=True,sep='\t',index=False) ## Write csv ## Main function def main(): uinfo=[] url='http://pm25.in/beijing '## get the web address html = request.urlopen(url).read().decode('utf-8') ##Open the web page, read the web page, and set the encoding format jiexi(uinfo,html) ## Calling the jiexi function print("Crawled") tofile(uinfo) ##Call the tofile function print('Saved') # Main interface if __name__ == '__main__': main()
Task 2: data processing and analysis
**
Use Python's pandas to read the bj20200721.csv file (the bj20200721.csv file is saved in the task030102\src2 directory under the source material folder) to complete data processing and analysis.
1) Copy the task030102 project under the source material folder to the examinee folder, open the project task030102 with pychart, and open task2.py.
2) Carry out problem analysis and write notes according to the actual project requirements.
3) Write the program, complete the definition of the readFile(filename) function, use pandas to read the data in bj20200721.csv file, and the return value type is DataFrame.
4) Write the program, complete the printInfo(filename) function definition, and display the contents in bj20200721.csv file, as shown in figure H2-1-1.
Figure H2-1-1 shows the contents of bj20200721.csv file
5) Write the program, complete the definition of the insertDate(filename) function, add the date data of 2020-07-21 as a new column to the Excel file, the column name is called "monitoring date", which is saved under the project task030102, and the file name is bj20200721.xlsx.
6) Write a program to complete the definition of aqi(filename) function and display the records with excellent air quality index category.
7) Write a program, import the module, call the user-defined readFile, printInfo, insertDate and aqi functions to realize the above functions, and pay attention to the correct setting of parameters.
8) Save the screenshot of the running result in the examinee folder and name it with task2.jpg file.
answer:
import pandas ## pandas provides functions and methods that enable us to process data quickly and conveniently # Read data def readFile(filename): file = pandas.read_csv(filename, sep='\t', encoding='utf-8') ##Read csv file return file # show contents def printInfo(filename): file = readFile(filename) ## Read data print(file) # insert data def insertDate(filename): file = readFile(filename) file['Monitoring date'] = pandas.to_datetime('2021-09-10') ##Insert column file.to_excel('./bj20200721.xlsx') ## Write to excel format file # Filter data def aqi(filename): file = readFile(filename) print(file.loc[file['Air quality index category'] == 'excellent']) ## Select the data with excellent air quality index category if __name__ == '__main__': filename = './bj20200721.csv' readFile(filename) printInfo(filename) print('----------------------------------------') insertDate(filename) print('Insert time succeeded') print('-----------------------------------') aqi(filename)
Task 3: Data Visualization
Use Python to read the data in the bj20200721.csv file (the bj20200721.csv file is saved in the src3 directory under the source material folder), and use matplotlib to draw the histogram to display the four monitoring points with the least aqi.
1) Use pychart to create the project task030103 under the examinee folder, and create a python file named task3.py under the project. Copy the src3 directory under the source material folder to the project task030103.
2) Carry out problem analysis and write notes according to the actual project requirements.
3) Write a program and use pandas to read the data in bj20200721.csv file. Use matplotlib to draw a histogram to display the minimum four monitoring points of AQI, as shown in figure H2-1-2. The generated file is saved under the project task030103, and the file is named aqi.png.
answer:
import pandas as pd import matplotlib.pyplot as plt # Set the format to display Chinese and display signs plt.rcParams['font.sans-serif'] = 'SimHei' plt.rcParams['axes.unicode_minus'] = 'False' # read file data = pd.read_csv('./src3/bj20200721.csv', sep='\t', encoding='utf-8') # Set canvas size and clarity plt.figure(figsize=(8, 7), dpi=100) # Set title plt.title('20200721 Beijing AQI Minimum 4 monitoring points') # Sort the read files by value by: sort category ascending: select the first four in ascending order file = data.sort_values(by='AQI', ascending=True)[:4] print(file) # Draw histogram, set abscissa and ordinate, width plt.bar(file['Monitoring point'], file['AQI'], width=0.8) plt.xlabel('Monitoring point') # Add x-axis name plt.ylabel('AQI') # Add y axis name plt.savefig('./api.png') # Save drawing to local plt.show()