Climbing Time of 001PyQuery Library

Keywords: Attribute encoding Python JQuery

Project requirements:

Because of the statistical analysis of mathematical modeling competition data, it is necessary to get the actual departure and landing time of all flights in an airport one day.

Code analysis:

Take the airport as an example, the time to arrive at the airport is the same.

from pyquery import PyQuery as pq

# Open the "depart.txt" file to facilitate data storage
fout = open('E:/depart.txt', 'w', encoding='utf8')

# Here the url must write a link to the full https://flights.ctrip.com/actualtime/depart-bjs.p1/.
#There were 59 pages of data on that day.
for i in range(1,59):
    cpage = 'https://flights.ctrip.com/actualtime/depart-bjs.p' + str(i) + '/'
    # Get the page structure of the current link to d
    d = pq(url=cpage)
    for i in range(20):                            #Each page has 20 rows of data
        datarow = d('.bg_0').eq(i).html() # Obtain the structure of line i from the. bg_0 attribute
        ddata=pq(datarow)
        #The time of arrival is the fourth column, and the replace() function is formatted for the file.
        fout.write(ddata('td').eq(3).text().replace("\n"," ")) # Write file content
        fout.write("\n")
        
# Close file
fout.close()

Test results:

Knowledge expansion

Introduction to PyQuery Library

PyQuery library is also a very powerful and flexible web page parsing library. PyQuery is a strict implementation of Python that imitates jQuery.
Official documentation: https://pythonhosted.org/pyquery/index.html#

PyQuery library use

Initialization (can load an HTML string, or an HTML file, or a url address)

from pyquery import PyQuery as pq
#String initialization
doc = pq("<html><title>hello</title></html>")
#URL initialization, where the URL must be fully written
doc = pq(url="http://www.baidu.com",encoding='utf-8')
#File initialization
d = pq(filename='path_to_html_file')

html() and text() - Get the corresponding HTML block or text block

p = pq("<head><title>hello</title></head>")
p('head').html()  # Return < title > Hello </title >
p('head').text()  # Return to hello

Get elements from HTML Tags

d = pq('<div><p>test 1</p><p>test 2</p></div>')   
d('p')    # Return [<p>, <p>]
print d('p')  # Return < p > Test 1 </p > < p > test 2</p>
print d('p').html()  # Return to test 1
#Note: When more than one element is retrieved, the html() and text() methods only return the corresponding content block of the first element.

eq(index) - Get the specified element based on the given index number

#For example, if you want to get the content of the second p tag, you can:
print d('p').eq(1).html()   # Return to test 2

Get elements directly by class name and id name

d = pq("<div><p id='1'>test 1</p><p class='2'>test 2</p></div>")
d('#1').html() # Return to test 1
d('.2').html() # Return to test 2

Getting and Modifying Attribute Values

#Get attribute values:
d = pq("<p id='my_id'><a href='http://hello.com'>hello</a></p>")
d('a').attr('href')  # Return to http://hello.com
d('p').attr('id')  # Return to my_id

#Modify attribute values:
d('a').attr('href', 'http://baidu.com')

Returns all subsequent element blocks: nextAll(selector=None)

d = pq("<p id='1'>hello</p><p id='2'>world</p><img scr='' />")
d('p:first').nextAll()   # Return[<p#2>, <img>]
d('p:last').nextAll()   # Return [<img>]

Get child and parent elements

# children(selector=None) -- Getting child elements
d = pq("<span><p id='1'>hello</p><p id='2'>world</p></span>")
d.children()   # Return[<p#1>, <p#2>]
d.children('#2')   # Return[<p#2>]

#parents(selector=None) - Get the parent element
d = pq("<span><p id='1'>hello</p><p id='2'>world</p></span>")
d('p').parents()    # Return [<span>]
d('#1').parents('span')   # Return [<span>]
d('#1').parents('p')   # Return []

Posted by fmpros on Mon, 30 Sep 2019 10:44:26 -0700

Programmer Group

Climbing Time of 001PyQuery Library

Project requirements:

Code analysis:

Knowledge expansion

Hot Keywords