Puppeter project practice

Reprint: https://zhuanlan.zhihu.com/p/76237595

Case1: screenshot

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Set visible area size
    await page.setViewport({width: 1920, height: 800});
    await page.goto('https://youdata.163.com');
    //Screenshot of the entire page
    await page.screenshot({
        path: './files/capture.png',  //Picture saving path
        type: 'png',
        fullPage: true //Screenshot while scrolling
        // clip: {x: 0, y: 0, width: 1920, height: 800}
    });
    //Screenshot of an element of the page
    let [element] = await page.$x('/html/body/section[4]/div/div[2]');
    await element.screenshot({
        path: './files/element.png'
    });
    await page.close();
    await browser.close();
})();

How do we get an element in the page?
page. $('#uniqueId'): get the first element corresponding to a selector
page. $$('div '): get all elements corresponding to a selector
page.$x('/ / img'): get all elements corresponding to an xPath
page.waitForXPath('/ / img'): wait for an element corresponding to an xPath to appear
page.waitForSelector('#uniqueId'): wait for the element corresponding to a selector to appear

case2: simulate user login

(async () => {
    const browser = await puppeteer.launch({
        slowMo: 100,    //Slow down
        headless: false,
        defaultViewport: {width: 1440, height: 780},
        ignoreHTTPSErrors: false, //Ignore https error
        args: ['--start-fullscreen'] //Open page in full screen
    });
    const page = await browser.newPage();
    await page.goto('https://demo.youdata.com');
    //Enter account password
    const uniqueIdElement = await page.$('#uniqueId');
    await uniqueIdElement.type('admin@admin.com', {delay: 20});
    const passwordElement = await page.$('#password', {delay: 20});
    await passwordElement.type('123456');
    //Click OK to log in
    let okButtonElement = await page.$('#btn-ok');
    //Wait for the page Jump to complete. Generally, when you click a button to jump, you need to wait for the execution of page.waitForNavigation() to indicate that the jump is successful
    await Promise.all([
        okButtonElement.click(),
        page.waitForNavigation()  
    ]);
    console.log('admin Login succeeded');
    await page.close();
    await browser.close();
})();

So what functions are provided by ElementHandle to operate elements?
elementHandle.click(): click an element
elementHandle.tap(): simulate finger touch and click
elementHandle.focus(): focus on an element
elementHandle.hover(): hover the mouse over an element
elementHandle.type('hello '): enter text in the input box

Case 3: request interception

Requests are necessary in some scenarios. Intercept unnecessary requests to improve performance. We can listen to the request event of Page and intercept requests on the premise that request interception page.setRequestInterception(true) is enabled.

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    const blockTypes = new Set(['image', 'media', 'font']);
    await page.setRequestInterception(true); //Turn on request interception
    page.on('request', request => {
        const type = request.resourceType();
        const shouldBlock = blockTypes.has(type);
        if(shouldBlock){
            //Block requests directly
            return request.abort();
        }else{
            //Rewrite request
            return request.continue({
                //You can override URLs, method s, postData, and headers
                headers: Object.assign({}, request.headers(), {
                    'puppeteer-test': 'true'
                })
            });
        }
    });
    await page.goto('https://demo.youdata.com');
    await page.close();
    await browser.close();
})();

What events are provided on the page?

page.on('close ') the page closes
page.on('console ') console API called
page.on('error ') page error
page.on('load ') page loaded
page.on('request ') received request
page.on('requestfailed ') request failed
page.on('requestfinished ') request succeeded
page.on('response ') received a response
page.on('workercreated ') creates a webWorker
page.on('workerdestroyed ') destroy webWorker
case4: get WebSocket response
Currently, puppeter does not provide a native API interface for processing WebSocket s, but we can obtain it through the lower layer Chrome DevTool Protocol (CDP)

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Create CDP session
    let cdpSession = await page.target().createCDPSession();
    //Enable Network debugging and listen for Network related events in Chrome DevTools Protocol
    await cdpSession.send('Network.enable');
    //Listen to the webSocketFrameReceived event and get the corresponding data
    cdpSession.on('Network.webSocketFrameReceived', frame => {
        let payloadData = frame.response.payloadData;
        if(payloadData.includes('push:query')){
            //Parse payloadData and get the data pushed by the server
            let res = JSON.parse(payloadData.match(/\{.*\}/)[0]);
            if(res.code !== 200){
                console.log(`call websocket Interface error:code=${res.code},message=${res.message}`);
            }else{
                console.log('Get websocket Interface data:', res.result);
            }
        }
    });
    await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493');
    await page.waitForFunction('window.renderdone', {polling: 20});
    await page.close();
    await browser.close();
})();

case5: embedding javascript code

The most powerful function of puppeter is that you can execute any javascript code you want to run in the browser. The following is the list of inbox users in mailbox 188. I found that there will be more iframes every time I open and close the inbox. With the increase of open inboxes, iframes will increase until the browser card can't run, So I added a script to delete useless iframes in the crawler Code:

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://webmail.vip.188.com');
    //Register a Node.js function and run it in the browser
    await page.exposeFunction('md5', text =>
        crypto.createHash('md5').update(text).digest('hex')
    );
    //Execute and delete useless iframe code in the browser through page.evaluate
    await page.evaluate(async () =>  {
        let iframes = document.getElementsByTagName('iframe');
        for(let i = 3; i <  iframes.length - 1; i++){
            let iframe = iframes[i];
            if(iframe.name.includes("frameBody")){
                iframe.src = 'about:blank';
                try{
                    iframe.contentWindow.document.write('');
                    iframe.contentWindow.document.clear();
                }catch(e){}
                //Remove iframe from page
                iframe.parentNode.removeChild(iframe);
            }
        }
        //Invoke the functions in the Node.js environment in the page
        const myHash = await window.md5('PUPPETEER');
        console.log(`md5 of ${myString} is ${myHash}`);
    });
    await page.close();
    await browser.close();
})();

What functions can execute code in a browser environment?

page.evaluate(pageFunction [,... args]): executes functions in the browser environment
page.evaluateHandle(pageFunction [,... args]): execute the function in the browser environment and return the JsHandle object
page.$$eval(selector, pageFunction [,... args]): all elements corresponding to the selector are passed into the function and executed in the browser environment
page.$eval(selector, pageFunction [,... args]): pass the first element corresponding to the selector into the function and execute it in the browser environment
page.evaluateOnNewDocument(pageFunction [,... args]): when creating a new Document, it will be executed in the browser environment before all scripts on the page are executed

page.exposeFunction(name, puppeteerFunction): register a function on the window object. This function is executed in Node environment. It has the opportunity to call Node.js correlation function library in browser environment.

case6: how to grab elements in iframe

A frame contains an Execution Context. We cannot execute functions across frames. There can be multiple frames in a page, mainly generated through iframe tag embedding. Most of the functions on the page are actually short for page.mainFrame().xx. Frame is a tree structure. We can traverse all frames through frame.childFrames(). If we want to execute functions in other frames, we must obtain the corresponding frame for corresponding processing

The following is an iframe embedded in the login window when you log in to mailbox 188. When you use the following code, we are acquiring the iframe and logging in

(async () => {
    const browser = await puppeteer.launch({headless: false, slowMo: 50});
    const page = await browser.newPage();
    await page.goto('https://www.188.com');
    //Click login with password
    let passwordLogin = await page.waitForXPath('//*[@id="qcode"]/div/div[2]/a');
    await passwordLogin.click();
    for (const frame of page.mainFrame().childFrames()){
        //Find the iframe corresponding to the login page according to the url
        if (frame.url().includes('passport.188.com')){
            await frame.type('.dlemail', 'admin@admin.com');
            await frame.type('.dlpwd', '123456');
            await Promise.all([
                frame.click('#dologin'),
                page.waitForNavigation()
            ]);
            break;
        }
    }
    await page.close();
    await browser.close();
})();

case7: page performance analysis

Puppeter provides a tool for page Performance analysis. At present, the function is still relatively weak. We can only obtain the data executed by one page Performance. How to analyze needs to be analyzed according to the data, It is said that a major revision will be made in version 2.0: - a browser can only trace once at a time - in devTools Performance, you can upload the corresponding JSON file and view the analysis results - we can write a script to parse the data in trace.json for automatic analysis - through tracing, we can obtain the page loading speed and script execution Performance

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.tracing.start({path: './files/trace.json'});
    await page.goto('https://www.google.com');
    await page.tracing.stop();
    /*
        continue analysis from 'trace.json'
    */
    browser.close();
})();

case8: file upload and download

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Set download path through CDP session
    const cdp = await page.target().createCDPSession();
    await cdp.send('Page.setDownloadBehavior', {
        behavior: 'allow', //Allow all download requests
        downloadPath: 'path/to/download'  //Set download path
    });
    //Click the button to trigger the download
    await (await page.waitForSelector('#someButton')).click();
    //Wait for the file to appear, and judge whether the file appears by rotation
    await waitForFile('path/to/download/filename');

    //When uploading, the corresponding inputElement must be a < input > element
    let inputElement = await page.waitForXPath('//input[@type="file"]');
    await inputElement.uploadFile('/path/to/file');
    browser.close();
})();

case9: jump to new tab page processing

When you click a button to jump to a new Tab Page, a new Page will be opened. At this time, how do we get the Page instance corresponding to the changed Page? This can be achieved by listening to the targetcreated event on the Browser, indicating that a new Page has been created:

let page = await browser.newPage();
await page.goto(url);
let btn = await page.waitForSelector('#btn');
//Before clicking the button, define a Promise in advance to return the Page object of the new tab
const newPagePromise = new Promise(res => 
  browser.once('targetcreated', 
    target => res(target.page())
  )
);
await btn.click();
//After clicking the button, wait for the new tab object
let newPage = await newPagePromise;

case10: simulate different devices

The puppeter provides the function of simulating different devices. The puppeter.devices object defines the configuration information of many devices. These configuration information mainly includes viewport and userAgent, and then realizes the simulation of different devices through the function page.simulate

const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.google.com');
  await browser.close();
});

Puppeteer vs Phantomjs
Fully real browser operation, supporting all Chrome features
Different versions of Chrome browser environment can be provided
Chrome team maintenance, with better compatibility and Prospects
The headless parameter is dynamically configured to facilitate debugging. You can enter the debugging interface for debugging through – remote debugging port = 9222
Support the latest JS syntax, such as async/await, etc
Complete event driven mechanism without too many sleep
The installation of Phantomjs environment is complex and the API call is not friendly
The main difference between the two is that Phantomjs uses an older version of WebKit as its rendering engine
It has faster and better performance than Phantomjs. The following are the performance comparison results of puppeter and Phantomjs by others:

Posted by dgx on Fri, 29 Oct 2021 02:42:51 -0700

Programmer Group