Reprint: https://zhuanlan.zhihu.com/p/76237595
Case1: screenshot
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); //Set visible area size await page.setViewport({width: 1920, height: 800}); await page.goto('https://youdata.163.com'); //Screenshot of the entire page await page.screenshot({ path: './files/capture.png', //Picture saving path type: 'png', fullPage: true //Screenshot while scrolling // clip: {x: 0, y: 0, width: 1920, height: 800} }); //Screenshot of an element of the page let [element] = await page.$x('/html/body/section[4]/div/div[2]'); await element.screenshot({ path: './files/element.png' }); await page.close(); await browser.close(); })();
-
How do we get an element in the page?
-
page. $('#uniqueId'): get the first element corresponding to a selector
-
page. $$('div '): get all elements corresponding to a selector
-
page.$x('/ / img'): get all elements corresponding to an xPath
-
page.waitForXPath('/ / img'): wait for an element corresponding to an xPath to appear
-
page.waitForSelector('#uniqueId'): wait for the element corresponding to a selector to appear
case2: simulate user login
(async () => { const browser = await puppeteer.launch({ slowMo: 100, //Slow down headless: false, defaultViewport: {width: 1440, height: 780}, ignoreHTTPSErrors: false, //Ignore https error args: ['--start-fullscreen'] //Open page in full screen }); const page = await browser.newPage(); await page.goto('https://demo.youdata.com'); //Enter account password const uniqueIdElement = await page.$('#uniqueId'); await uniqueIdElement.type('admin@admin.com', {delay: 20}); const passwordElement = await page.$('#password', {delay: 20}); await passwordElement.type('123456'); //Click OK to log in let okButtonElement = await page.$('#btn-ok'); //Wait for the page Jump to complete. Generally, when you click a button to jump, you need to wait for the execution of page.waitForNavigation() to indicate that the jump is successful await Promise.all([ okButtonElement.click(), page.waitForNavigation() ]); console.log('admin Login succeeded'); await page.close(); await browser.close(); })();
-
So what functions are provided by ElementHandle to operate elements?
-
elementHandle.click(): click an element
-
elementHandle.tap(): simulate finger touch and click
-
elementHandle.focus(): focus on an element
-
elementHandle.hover(): hover the mouse over an element
-
elementHandle.type('hello '): enter text in the input box
Case 3: request interception
Requests are necessary in some scenarios. Intercept unnecessary requests to improve performance. We can listen to the request event of Page and intercept requests on the premise that request interception page.setRequestInterception(true) is enabled.
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); const blockTypes = new Set(['image', 'media', 'font']); await page.setRequestInterception(true); //Turn on request interception page.on('request', request => { const type = request.resourceType(); const shouldBlock = blockTypes.has(type); if(shouldBlock){ //Block requests directly return request.abort(); }else{ //Rewrite request return request.continue({ //You can override URLs, method s, postData, and headers headers: Object.assign({}, request.headers(), { 'puppeteer-test': 'true' }) }); } }); await page.goto('https://demo.youdata.com'); await page.close(); await browser.close(); })();
What events are provided on the page?
- page.on('close ') the page closes
- page.on('console ') console API called
- page.on('error ') page error
- page.on('load ') page loaded
- page.on('request ') received request
- page.on('requestfailed ') request failed
- page.on('requestfinished ') request succeeded
- page.on('response ') received a response
- page.on('workercreated ') creates a webWorker
- page.on('workerdestroyed ') destroy webWorker
- case4: get WebSocket response
- Currently, puppeter does not provide a native API interface for processing WebSocket s, but we can obtain it through the lower layer Chrome DevTool Protocol (CDP)
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); //Create CDP session let cdpSession = await page.target().createCDPSession(); //Enable Network debugging and listen for Network related events in Chrome DevTools Protocol await cdpSession.send('Network.enable'); //Listen to the webSocketFrameReceived event and get the corresponding data cdpSession.on('Network.webSocketFrameReceived', frame => { let payloadData = frame.response.payloadData; if(payloadData.includes('push:query')){ //Parse payloadData and get the data pushed by the server let res = JSON.parse(payloadData.match(/\{.*\}/)[0]); if(res.code !== 200){ console.log(`call websocket Interface error:code=${res.code},message=${res.message}`); }else{ console.log('Get websocket Interface data:', res.result); } } }); await page.goto('https://netease.youdata.163.com/dash/142161/reportExport?pid=700209493'); await page.waitForFunction('window.renderdone', {polling: 20}); await page.close(); await browser.close(); })();
case5: embedding javascript code
The most powerful function of puppeter is that you can execute any javascript code you want to run in the browser. The following is the list of inbox users in mailbox 188. I found that there will be more iframes every time I open and close the inbox. With the increase of open inboxes, iframes will increase until the browser card can't run, So I added a script to delete useless iframes in the crawler Code:
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://webmail.vip.188.com'); //Register a Node.js function and run it in the browser await page.exposeFunction('md5', text => crypto.createHash('md5').update(text).digest('hex') ); //Execute and delete useless iframe code in the browser through page.evaluate await page.evaluate(async () => { let iframes = document.getElementsByTagName('iframe'); for(let i = 3; i < iframes.length - 1; i++){ let iframe = iframes[i]; if(iframe.name.includes("frameBody")){ iframe.src = 'about:blank'; try{ iframe.contentWindow.document.write(''); iframe.contentWindow.document.clear(); }catch(e){} //Remove iframe from page iframe.parentNode.removeChild(iframe); } } //Invoke the functions in the Node.js environment in the page const myHash = await window.md5('PUPPETEER'); console.log(`md5 of ${myString} is ${myHash}`); }); await page.close(); await browser.close(); })();
What functions can execute code in a browser environment?
- page.evaluate(pageFunction [,... args]): executes functions in the browser environment
- page.evaluateHandle(pageFunction [,... args]): execute the function in the browser environment and return the JsHandle object
- page.$$eval(selector, pageFunction [,... args]): all elements corresponding to the selector are passed into the function and executed in the browser environment
- page.$eval(selector, pageFunction [,... args]): pass the first element corresponding to the selector into the function and execute it in the browser environment
- page.evaluateOnNewDocument(pageFunction [,... args]): when creating a new Document, it will be executed in the browser environment before all scripts on the page are executed
page.exposeFunction(name, puppeteerFunction): register a function on the window object. This function is executed in Node environment. It has the opportunity to call Node.js correlation function library in browser environment.
case6: how to grab elements in iframe
A frame contains an Execution Context. We cannot execute functions across frames. There can be multiple frames in a page, mainly generated through iframe tag embedding. Most of the functions on the page are actually short for page.mainFrame().xx. Frame is a tree structure. We can traverse all frames through frame.childFrames(). If we want to execute functions in other frames, we must obtain the corresponding frame for corresponding processing
The following is an iframe embedded in the login window when you log in to mailbox 188. When you use the following code, we are acquiring the iframe and logging in
(async () => { const browser = await puppeteer.launch({headless: false, slowMo: 50}); const page = await browser.newPage(); await page.goto('https://www.188.com'); //Click login with password let passwordLogin = await page.waitForXPath('//*[@id="qcode"]/div/div[2]/a'); await passwordLogin.click(); for (const frame of page.mainFrame().childFrames()){ //Find the iframe corresponding to the login page according to the url if (frame.url().includes('passport.188.com')){ await frame.type('.dlemail', 'admin@admin.com'); await frame.type('.dlpwd', '123456'); await Promise.all([ frame.click('#dologin'), page.waitForNavigation() ]); break; } } await page.close(); await browser.close(); })();
case7: page performance analysis
Puppeter provides a tool for page Performance analysis. At present, the function is still relatively weak. We can only obtain the data executed by one page Performance. How to analyze needs to be analyzed according to the data, It is said that a major revision will be made in version 2.0: - a browser can only trace once at a time - in devTools Performance, you can upload the corresponding JSON file and view the analysis results - we can write a script to parse the data in trace.json for automatic analysis - through tracing, we can obtain the page loading speed and script execution Performance
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.tracing.start({path: './files/trace.json'}); await page.goto('https://www.google.com'); await page.tracing.stop(); /* continue analysis from 'trace.json' */ browser.close(); })();
case8: file upload and download
(async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); //Set download path through CDP session const cdp = await page.target().createCDPSession(); await cdp.send('Page.setDownloadBehavior', { behavior: 'allow', //Allow all download requests downloadPath: 'path/to/download' //Set download path }); //Click the button to trigger the download await (await page.waitForSelector('#someButton')).click(); //Wait for the file to appear, and judge whether the file appears by rotation await waitForFile('path/to/download/filename'); //When uploading, the corresponding inputElement must be a < input > element let inputElement = await page.waitForXPath('//input[@type="file"]'); await inputElement.uploadFile('/path/to/file'); browser.close(); })();
case9: jump to new tab page processing
When you click a button to jump to a new Tab Page, a new Page will be opened. At this time, how do we get the Page instance corresponding to the changed Page? This can be achieved by listening to the targetcreated event on the Browser, indicating that a new Page has been created:
let page = await browser.newPage(); await page.goto(url); let btn = await page.waitForSelector('#btn'); //Before clicking the button, define a Promise in advance to return the Page object of the new tab const newPagePromise = new Promise(res => browser.once('targetcreated', target => res(target.page()) ) ); await btn.click(); //After clicking the button, wait for the new tab object let newPage = await newPagePromise;
case10: simulate different devices
The puppeter provides the function of simulating different devices. The puppeter.devices object defines the configuration information of many devices. These configuration information mainly includes viewport and userAgent, and then realizes the simulation of different devices through the function page.simulate
const puppeteer = require('puppeteer'); const iPhone = puppeteer.devices['iPhone 6']; puppeteer.launch().then(async browser => { const page = await browser.newPage(); await page.emulate(iPhone); await page.goto('https://www.google.com'); await browser.close(); });
- Puppeteer vs Phantomjs
- Fully real browser operation, supporting all Chrome features
- Different versions of Chrome browser environment can be provided
- Chrome team maintenance, with better compatibility and Prospects
- The headless parameter is dynamically configured to facilitate debugging. You can enter the debugging interface for debugging through – remote debugging port = 9222
- Support the latest JS syntax, such as async/await, etc
- Complete event driven mechanism without too many sleep
- The installation of Phantomjs environment is complex and the API call is not friendly
- The main difference between the two is that Phantomjs uses an older version of WebKit as its rendering engine
- It has faster and better performance than Phantomjs. The following are the performance comparison results of puppeter and Phantomjs by others: