Acquisition and analysis of air quality monitoring data in Beijing

Keywords: Python crawler

Task 1: data collection

Web page“ http://pm25.in/beijing ”It contains the air quality monitoring data of 12 monitoring points in Beijing. Please write a program to capture the monitoring points, AQI and air quality index categories on the web page (the web page samples are saved in the src1 directory under the source material folder), as shown in table H2-1-1. Save the captured data and name it with bj20200721.csv file.
Table H2-1-1 monitoring data of air quality in Beijing
AQI air quality index category of monitoring points
Longevity Palace 57 Liang
...... ...... ......
1) Use pychart to create the project task030101 under the examinee folder, and create a python file under the project, named task1.py. Copy the src1 directory under the source material folder to the project task030101.
2) Carry out problem analysis and write notes according to the actual project requirements.
3) Write a program to correctly crawl the data in the web page, run the program and display "crawled".
4) Write a program, save the captured data under the project task030101, name it with bj20200721.csv file, and the separator is tab key, run the program and display "saved".
5) Save the screenshot of the running result in the examinee folder and name it with task1.jpg file.
**

answer:

import bs4
from bs4 import BeautifulSoup
from urllib import request
import pandas
## Parsing web content
# ulist: save the parsed web page content, html: the incoming web page content
def jiexi(ulist,html):
    soup = BeautifulSoup(html, 'html.parser') ##Create a beautiful soup object and specify that the parser of TreeBuilder in beautiful soup is html.parser
    for tr in soup.find('tbody').children: ## All tr tags under the loop tag tbody, an iteration object
        if isinstance(tr,bs4.element.Tag): ## Determine whether bs4.element.Tag type
            tds = tr('td')  ## Get the data in this row
            ulist.append([tds[0].string,tds[1].string,tds[2].string]) ## Add to three

## Analyze the data and wait for the two-dimensional table
def tofile(ulist):
    data = pandas.DataFrame(ulist)  ##Data frame, making two-dimensional table
    data.columns =['Monitoring point','AQI','Air quality index category']  ## Set column name
    data.to_csv('./bj20200721.csv',header=True,sep='\t',index=False)  ## Write csv
## Main function
def main():
    uinfo=[]
    url='http://pm25.in/beijing '## get the web address
    html = request.urlopen(url).read().decode('utf-8') ##Open the web page, read the web page, and set the encoding format
    jiexi(uinfo,html) ## Calling the jiexi function
    print("Crawled")
    tofile(uinfo)   ##Call the tofile function
    print('Saved')
#     Main interface
if __name__ == '__main__':
    main()

Task 2: data processing and analysis

**
Use Python's pandas to read the bj20200721.csv file (the bj20200721.csv file is saved in the task030102\src2 directory under the source material folder) to complete data processing and analysis.
1) Copy the task030102 project under the source material folder to the examinee folder, open the project task030102 with pychart, and open task2.py.
2) Carry out problem analysis and write notes according to the actual project requirements.
3) Write the program, complete the definition of the readFile(filename) function, use pandas to read the data in bj20200721.csv file, and the return value type is DataFrame.
4) Write the program, complete the printInfo(filename) function definition, and display the contents in bj20200721.csv file, as shown in figure H2-1-1.

Figure H2-1-1 shows the contents of bj20200721.csv file
5) Write the program, complete the definition of the insertDate(filename) function, add the date data of 2020-07-21 as a new column to the Excel file, the column name is called "monitoring date", which is saved under the project task030102, and the file name is bj20200721.xlsx.
6) Write a program to complete the definition of aqi(filename) function and display the records with excellent air quality index category.
7) Write a program, import the module, call the user-defined readFile, printInfo, insertDate and aqi functions to realize the above functions, and pay attention to the correct setting of parameters.
8) Save the screenshot of the running result in the examinee folder and name it with task2.jpg file.

answer:

import pandas  ## pandas provides functions and methods that enable us to process data quickly and conveniently

# Read data
def readFile(filename):
    file = pandas.read_csv(filename, sep='\t', encoding='utf-8') ##Read csv file
    return file


# show contents
def printInfo(filename):
    file = readFile(filename)  ## Read data
    print(file)

# insert data
def insertDate(filename):
   file = readFile(filename)
   file['Monitoring date'] = pandas.to_datetime('2021-09-10') ##Insert column
   file.to_excel('./bj20200721.xlsx') ## Write to excel format file

# Filter data
def aqi(filename):
    file = readFile(filename)
    print(file.loc[file['Air quality index category'] == 'excellent'])  ## Select the data with excellent air quality index category


if __name__ == '__main__':
    filename = './bj20200721.csv'
    readFile(filename)
    printInfo(filename)
    print('----------------------------------------')
    insertDate(filename)
    print('Insert time succeeded')
    print('-----------------------------------')
    aqi(filename)

Task 3: Data Visualization

Use Python to read the data in the bj20200721.csv file (the bj20200721.csv file is saved in the src3 directory under the source material folder), and use matplotlib to draw the histogram to display the four monitoring points with the least aqi.
1) Use pychart to create the project task030103 under the examinee folder, and create a python file named task3.py under the project. Copy the src3 directory under the source material folder to the project task030103.
2) Carry out problem analysis and write notes according to the actual project requirements.
3) Write a program and use pandas to read the data in bj20200721.csv file. Use matplotlib to draw a histogram to display the minimum four monitoring points of AQI, as shown in figure H2-1-2. The generated file is saved under the project task030103, and the file is named aqi.png.
answer:

import pandas as pd
import matplotlib.pyplot as plt
# Set the format to display Chinese and display signs
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.rcParams['axes.unicode_minus'] = 'False'
# read file
data = pd.read_csv('./src3/bj20200721.csv', sep='\t', encoding='utf-8')
# Set canvas size and clarity
plt.figure(figsize=(8, 7), dpi=100)
# Set title
plt.title('20200721 Beijing AQI Minimum 4 monitoring points')
# Sort the read files by value by: sort category ascending: select the first four in ascending order
file = data.sort_values(by='AQI', ascending=True)[:4]
print(file)
# Draw histogram, set abscissa and ordinate, width
plt.bar(file['Monitoring point'], file['AQI'], width=0.8)
plt.xlabel('Monitoring point')  # Add x-axis name
plt.ylabel('AQI')    # Add y axis name
plt.savefig('./api.png')  # Save drawing to local
plt.show()

Posted by SpectralDesign.Net on Thu, 16 Sep 2021 10:35:35 -0700