Puppeter project practice

Keywords: Javascript node.js Front-end

Reprint: https://zhuanlan.zhihu.com/p/76237595

Case1: screenshot

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Set visible area size
    await page.setViewport({width: 1920, height: 800});
    await page.goto('https://youdata.163.com');
    //Screenshot of the entire page
    await page.screenshot({
        path: './files/capture.png',  //Picture saving path
        type: 'png',
        fullPage: true //Screenshot while scrolling
        // clip: {x: 0, y: 0, width: 1920, height: 800}
    });
    //Screenshot of an element of the page
    let [element] = await page.$x('/html/body/section[4]/div/div[2]');
    await element.screenshot({
        path: './files/element.png'
    });
    await page.close();
    await browser.close();
})();
  • How do we get an element in the page?

  • page. $('#uniqueId'): get the first element corresponding to a selector

  • page. $$('div '): get all elements corresponding to a selector

  • page.$x('/ / img'): get all elements corresponding to an xPath

  • page.waitForXPath('/ / img'): wait for an element corresponding to an xPath to appear

  • page.waitForSelector('#uniqueId'): wait for the element corresponding to a selector to appear

case2: simulate user login

(async () => {
    const browser = await puppeteer.launch({
        slowMo: 100,    //Slow down
        headless: false,
        defaultViewport: {width: 1440, height: 780},
        ignoreHTTPSErrors: false, //Ignore https error
        args: ['--start-fullscreen'] //Open page in full screen
    });
    const page = await browser.newPage();
    await page.goto('https://demo.youdata.com');
    //Enter account password
    const uniqueIdElement = await page.$('#uniqueId');
    await uniqueIdElement.type('admin@admin.com', {delay: 20});
    const passwordElement = await page.$('#password', {delay: 20});
    await passwordElement.type('123456');
    //Click OK to log in
    let okButtonElement = await page.$('#btn-ok');
    //Wait for the page Jump to complete. Generally, when you click a button to jump, you need to wait for the execution of page.waitForNavigation() to indicate that the jump is successful
    await Promise.all([
        okButtonElement.click(),
        page.waitForNavigation()  
    ]);
    console.log('admin Login succeeded');
    await page.close();
    await browser.close();
})();
  • So what functions are provided by ElementHandle to operate elements?

  • elementHandle.click(): click an element

  • elementHandle.tap(): simulate finger touch and click

  • elementHandle.focus(): focus on an element

  • elementHandle.hover(): hover the mouse over an element

  • elementHandle.type('hello '): enter text in the input box

Case 3: request interception

Requests are necessary in some scenarios. Intercept unnecessary requests to improve performance. We can listen to the request event of Page and intercept requests on the premise that request interception page.setRequestInterception(true) is enabled.

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    const blockTypes = new Set(['image', 'media', 'font']);
    await page.setRequestInterception(true); //Turn on request interception
    page.on('request', request => {
        const type = request.resourceType();
        const shouldBlock = blockTypes.has(type);
        if(shouldBlock){
            //Block requests directly
            return request.abort();
        }else{
            //Rewrite request
            return request.continue({
                //You can override URLs, method s, postData, and headers
                headers: Object.assign({}, request.headers(), {
                    'puppeteer-test': 'true'
                })
            });
        }
    });
    await page.goto('https://demo.youdata.com');
    await page.close();
    await browser.close();
})();

What events are provided on the page?

  • page.on('close ') the page closes
  • page.on('console ') console API called
  • page.on('error ') page error
  • page.on('load ') page loaded
  • page.on('request ') received request
  • page.on('requestfailed ') request failed
  • page.on('requestfinished ') request succeeded
  • page.on('response ') received a response
  • page.on('workercreated ') creates a webWorker
  • page.on('workerdestroyed ') destroy webWorker
  • case4: get WebSocket response
  • Currently, puppeter does not provide a native API interface for processing WebSocket s, but we can obtain it through the lower layer Chrome DevTool Protocol (CDP)
(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Create CDP session
    let cdpSession = await page.target().createCDPSession();
    //Enable Network debugging and listen for Network related events in Chrome DevTools Protocol
    await cdpSession.send('Network.enable');
    //Listen to the webSocketFrameReceived event and get the corresponding data
    cdpSession.on('Network.webSocketFrameReceived', frame => {
        let payloadData = frame.response.payloadData;
        if(payloadData.includes('push:query')){
            //Parse payloadData and get the data pushed by the server
            let res = JSON.parse(payloadData.match(/\{.*\}/)[0]);
            if(res.code !== 200){
                console.log(`call websocket Interface error:code=${res.code},message=${res.message}`);
            }else{
                console.log('Get websocket Interface data:', res.result);
            }
        }
    });
    await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493');
    await page.waitForFunction('window.renderdone', {polling: 20});
    await page.close();
    await browser.close();
})();

case5: embedding javascript code

The most powerful function of puppeter is that you can execute any javascript code you want to run in the browser. The following is the list of inbox users in mailbox 188. I found that there will be more iframes every time I open and close the inbox. With the increase of open inboxes, iframes will increase until the browser card can't run, So I added a script to delete useless iframes in the crawler Code:

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('https://webmail.vip.188.com');
    //Register a Node.js function and run it in the browser
    await page.exposeFunction('md5', text =>
        crypto.createHash('md5').update(text).digest('hex')
    );
    //Execute and delete useless iframe code in the browser through page.evaluate
    await page.evaluate(async () =>  {
        let iframes = document.getElementsByTagName('iframe');
        for(let i = 3; i <  iframes.length - 1; i++){
            let iframe = iframes[i];
            if(iframe.name.includes("frameBody")){
                iframe.src = 'about:blank';
                try{
                    iframe.contentWindow.document.write('');
                    iframe.contentWindow.document.clear();
                }catch(e){}
                //Remove iframe from page
                iframe.parentNode.removeChild(iframe);
            }
        }
        //Invoke the functions in the Node.js environment in the page
        const myHash = await window.md5('PUPPETEER');
        console.log(`md5 of ${myString} is ${myHash}`);
    });
    await page.close();
    await browser.close();
})();

What functions can execute code in a browser environment?

  • page.evaluate(pageFunction [,... args]): executes functions in the browser environment
  • page.evaluateHandle(pageFunction [,... args]): execute the function in the browser environment and return the JsHandle object
  • page.$$eval(selector, pageFunction [,... args]): all elements corresponding to the selector are passed into the function and executed in the browser environment
  • page.$eval(selector, pageFunction [,... args]): pass the first element corresponding to the selector into the function and execute it in the browser environment
  • page.evaluateOnNewDocument(pageFunction [,... args]): when creating a new Document, it will be executed in the browser environment before all scripts on the page are executed

page.exposeFunction(name, puppeteerFunction): register a function on the window object. This function is executed in Node environment. It has the opportunity to call Node.js correlation function library in browser environment.

case6: how to grab elements in iframe

A frame contains an Execution Context. We cannot execute functions across frames. There can be multiple frames in a page, mainly generated through iframe tag embedding. Most of the functions on the page are actually short for page.mainFrame().xx. Frame is a tree structure. We can traverse all frames through frame.childFrames(). If we want to execute functions in other frames, we must obtain the corresponding frame for corresponding processing

The following is an iframe embedded in the login window when you log in to mailbox 188. When you use the following code, we are acquiring the iframe and logging in

(async () => {
    const browser = await puppeteer.launch({headless: false, slowMo: 50});
    const page = await browser.newPage();
    await page.goto('https://www.188.com');
    //Click login with password
    let passwordLogin = await page.waitForXPath('//*[@id="qcode"]/div/div[2]/a');
    await passwordLogin.click();
    for (const frame of page.mainFrame().childFrames()){
        //Find the iframe corresponding to the login page according to the url
        if (frame.url().includes('passport.188.com')){
            await frame.type('.dlemail', 'admin@admin.com');
            await frame.type('.dlpwd', '123456');
            await Promise.all([
                frame.click('#dologin'),
                page.waitForNavigation()
            ]);
            break;
        }
    }
    await page.close();
    await browser.close();
})();

case7: page performance analysis

Puppeter provides a tool for page Performance analysis. At present, the function is still relatively weak. We can only obtain the data executed by one page Performance. How to analyze needs to be analyzed according to the data, It is said that a major revision will be made in version 2.0: - a browser can only trace once at a time - in devTools Performance, you can upload the corresponding JSON file and view the analysis results - we can write a script to parse the data in trace.json for automatic analysis - through tracing, we can obtain the page loading speed and script execution Performance

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.tracing.start({path: './files/trace.json'});
    await page.goto('https://www.google.com');
    await page.tracing.stop();
    /*
        continue analysis from 'trace.json'
    */
    browser.close();
})();

case8: file upload and download

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //Set download path through CDP session
    const cdp = await page.target().createCDPSession();
    await cdp.send('Page.setDownloadBehavior', {
        behavior: 'allow', //Allow all download requests
        downloadPath: 'path/to/download'  //Set download path
    });
    //Click the button to trigger the download
    await (await page.waitForSelector('#someButton')).click();
    //Wait for the file to appear, and judge whether the file appears by rotation
    await waitForFile('path/to/download/filename');

    //When uploading, the corresponding inputElement must be a < input > element
    let inputElement = await page.waitForXPath('//input[@type="file"]');
    await inputElement.uploadFile('/path/to/file');
    browser.close();
})();

case9: jump to new tab page processing

When you click a button to jump to a new Tab Page, a new Page will be opened. At this time, how do we get the Page instance corresponding to the changed Page? This can be achieved by listening to the targetcreated event on the Browser, indicating that a new Page has been created:

let page = await browser.newPage();
await page.goto(url);
let btn = await page.waitForSelector('#btn');
//Before clicking the button, define a Promise in advance to return the Page object of the new tab
const newPagePromise = new Promise(res => 
  browser.once('targetcreated', 
    target => res(target.page())
  )
);
await btn.click();
//After clicking the button, wait for the new tab object
let newPage = await newPagePromise;

case10: simulate different devices

The puppeter provides the function of simulating different devices. The puppeter.devices object defines the configuration information of many devices. These configuration information mainly includes viewport and userAgent, and then realizes the simulation of different devices through the function page.simulate

const puppeteer = require('puppeteer');
const iPhone = puppeteer.devices['iPhone 6'];
puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.emulate(iPhone);
  await page.goto('https://www.google.com');
  await browser.close();
});
  • Puppeteer vs Phantomjs
  • Fully real browser operation, supporting all Chrome features
  • Different versions of Chrome browser environment can be provided
  • Chrome team maintenance, with better compatibility and Prospects
  • The headless parameter is dynamically configured to facilitate debugging. You can enter the debugging interface for debugging through – remote debugging port = 9222
  • Support the latest JS syntax, such as async/await, etc
  • Complete event driven mechanism without too many sleep
  • The installation of Phantomjs environment is complex and the API call is not friendly
  • The main difference between the two is that Phantomjs uses an older version of WebKit as its rendering engine
  • It has faster and better performance than Phantomjs. The following are the performance comparison results of puppeter and Phantomjs by others:

Posted by dgx on Fri, 29 Oct 2021 02:42:51 -0700