https://www.cnblogs.com/summer1019/p/10364469.html
This is a picture I wrote a few days ago about "multi process crawling for beautiful women with high beauty". Today, I have nothing to do to climb for some new beautiful anchors-_-
But didn't crawl to any content. Open the source code of the web page and find that the property of the saved image becomes' src '. After modification, you can crawl
But estimated that the technicians were a little angry. Most of the pictures they crawled were black and white with the same color
But I studied the source code carefully and found that the betta technician hid all the picture information at the back...
Remarks:
Technology Xiaobai,
Self study and self practice,
Right as entertainment,
Welcome to correct.
Explain:
(1) in the previous version, as long as the attribute value "data original" is changed to "src", a small number of tall beauties can be crawled. I think it's the small welfare left by douyu. Most of the pictures are black-and-white douyu pictures.
(2) in version 2.0, each picture is divided into big and small pixels, which is twice as many as the previous pictures - 480. Different pixels look different.
1 import requests 2 import re 3 # from bs4 import BeautifulSoup 4 from urllib import request 5 # import threading 6 import gevent 7 from gevent import monkey 8 9 monkey.patch_all() 10 11 def get_html_text(url): 12 try: 13 r = requests.get(url, timeout=10) 14 r.raise_for_status() 15 r.encoding = r.apparent_encoding 16 print(len(r.text)) 17 return r.text 18 except Exception as result: 19 print('Error type:', result) 20 21 22 def html_text_parser(img_list, html): 23 24 # The following is the key part of the modification. Regular expressions are used instead of bs4 25 # Technology God can study the source code of the web page, real picture information is stored in the back, not traditional
# html, I haven't touched the knowledge of the front-end, I don't know how to use bs4, so I use regular expressions. 26 27 img_pat = r'"rs\w+":"(.*?)"' 28 links = re.compile(img_pat, re.S).findall(html) 29 print(len(links)) 30 print(links) 31 for link in links: 32 if link: 33 img_list.append(link) 34 return img_list 35 36 37 38 39 def get_douyu_img(Img_list): 40 for i,j in enumerate(Img_list): 41 # name = j.split('.')[-1] 42 r = request.urlopen(j) 43 ima_content = r.read() 44 path = str(i) 45 with open(path, 'wb') as f: 46 f.write(ima_content) 47 48 def main(): 49 url = 'https://www.douyu.com/g_yz' 50 html = get_html_text(url) 51 img_list = list() 52 Img_list = html_text_parser(img_list, html) 53 # print(Img_list) 54 #t1 = threading.Thread(target=get_html_text, args=(url,)) 55 #t2 = threading.Thread(target=html_text_parser, args=(img_list,html)) 56 #t3 = threading.Thread(target=get_douyu_img, args=(Img_list,)) 57 #t1.start() 58 #t2.start() 59 #t3.start() 60 gevent.joinall([ 61 gevent.spawn(get_html_text, url), 62 gevent.spawn(html_text_parser, img_list, html), 63 gevent.spawn(get_douyu_img, Img_list) 64 ]) 65 66 67 if __name__ == '__main__': 68 main()