Introduction to entry-level XPath

Keywords: Attribute Python

brief introduction

Because data is needed in the project, so we learned python to write the data of crawler crawling some East, some cat and some Ning. The first crawler framework is scrapy , to use XPath, I quickly learned a wave of things to record. The data comes from the project, which is real and effective.

Technological process

Because I am also a beginner, I crawled some quadratic data on the Internet, like this:

The corresponding source code is as follows:

<li class="js-smallCards _box" data-since="25781.481">
    <a href="/item/detail/6589006507025629453" class="db posr ovf" target="_blank" title=" Noctilucent-Night">
    <img class="cardImage"
             src="https://img5.bcyimg.com/user/792056/item/c0jm5/7e198b2677f740348b60b276990cb0f1.jpg/2X3"> </a>
    <footer class="l-clearfix">
        <a href="/u/792056" target="_blank" class="_avatar _avatar--user _avatar--xxxsm mr5 vam">
        <img src="https://user.bcyimg.com/Public/Upload/avatar/792056/512d027c1b494f9a9351668fb0b4ddd7/fat.jpg/amiddle"></a>
        <a href="/u/792056" target="_blank" class="name">
            <span class="fz12 lh18 username cut dib vam">Noctilucent-Night</span>
        </a>
        <div class="l-right">
            <i class="i-liked-gray"></i>
            <span class="like">385</span>
        </div>
    </footer>
</li>

<li class="js-smallCards _box" data-since="25781.476">
    <a href="/item/detail/6590220468408549645" class="db posr ovf" target="_blank" title=" Decline and decline">
    <img class="cardImage" src="https://img9.bcyimg.com/user/3135480/item/c0jm8/85e3bda51d0c4856bada0adc29b6e993.jpg/2X3"> </a>
    <footer class="l-clearfix">
        <a href="/u/3135480" target="_blank" class="_avatar _avatar--user _avatar--xxxsm mr5 vam">
            <img src="https://user.bcyimg.com/Public/Upload/avatar/3135480/f6342ccf044e465a8c4d84c807fef29b/fat.jpg/amiddle">
        </a>
        <a href="/u/3135480" target="_blank" class="name">
            <span class="fz12 lh18 username cut dib vam">Decline and decline</span>
        </a>
        <div class="l-right">
            <i class="i-liked-gray"></i>
            <span class="like">127</span>
        </div>
    </footer>
</li>

The simple point is this:

<li class="js-smallCards _box" data-since="29834.23">
    <a class="detail_url" href="" title="">
        <img class="cardImage" src=""/>
    </a>

    <footer class="l-clearfix">    
        <a class="_avatar" href=""> 
            <img src="" />
        </a>

        <a class="name">
            <span>Decline and decline</span>
        </a>

        <div class="l-right">
            <span class="like">127</span>
        </div>
    </footer>
</li>

Get relevant data

1. Get all li data in the list:

// Get all li from the root directory [nothing, as long as it's li, get all]
response.xpath("//li")

// Get the li class whose class is JS smallcards [use class to get] from the root directory
response.xpath('//li[@class="js-smallCards _box"]')

//Get the li class with id li from the root directory [use class to get]
response.xpath('//li[@id="id"]')

The results are as follows:

As you can see, the result obtained is in the form of an array, indicating that there is more than one result

2. Get the data of the li in the list

//Get the first li data
response.xpath("//li[1]")

//Get the last li data
response.xpath("//li[last()]")

//Get the data since attribute=1Of li data
response.xpath("//li[@data-since='25781.481']")

//Get the last li data or data since=1Data
response.xpath("//li[last()] | //li[@data-since='25781.481']")

//Get the data since attribute data of the first li
response.xpath("//li[1]/@data-since").extract()
or
response.xpath("//li/@data-since").extract()[1]

3. Get the child elements of li

// Get all a tags under the first li
response.xpath("//li[1]/a")

//Get the href attribute of the first a tag under the first li
response.xpath("//li[1]/a[1]/@href")

//Get the img tag under the first a tag under the first li
response.xpath("//li[1]/a[1]/img")

//Get the src attribute of img tag under the first a tag under the first li
response.xpath("//li[1]/a[1]/img/@src")

//Get the label of the first span under li
response.xpath("//li[1]/footer[1]/div[1]/span")

//Get text for span
response.xpath("//span[1]/text()")

Basically, that's all. For other things, we need to add them.

Posted by gdogfunk on Sun, 05 Jan 2020 02:44:03 -0800