python crawler (19) crawling forum website-gif dynamic graph common on the network

Keywords: Database Selenium github network


http://blog.csdn.net/qiqiyingse http://blog.csdn.net/qiqiyingse/article/details/78501034
I haven't written a crawler related article for some time. Today I take time to share with you a program I did before.

People who often visit stations A and B are certainly familiar with a program.< Common GIF Dynamic Diagrams on the Network>

Today I'm going to share how to automatically store these actions in my computer through the crawler (in fact, this program was written in May, and it's been delayed until now to share it).

I. Thought Analysis

According to the basic law of reptiles:
1. Find a goal
2. Target capture
3. Processing the target content and obtaining useful information

1. First, our goal is: http://gifcc.com/forum.php.


This website is a forum website, which is divided into several categories, anyway, try a variety of motion maps.
Our goal is to find the geographic (electrical) address (brain) of this (Tibetan) map.

2. Take a look at the websites of each module and see what the rules are.

'http://gifcc.com/forum-37-1.html', # other sources of various GIF dynamic diagrams
'http://gifcc.com/forum-38-1.html', #Beauty GIF Dynamic Map Origin
'http://gifcc.com/forum-47-1.html', # Sci-fi Fantasy Movie GIF Dynamic Map Origin
'http://gifcc.com/forum-48-1.html', #Comedy Funny Movie GIF Dynamic Map Origin
'http://gifcc.com/forum-49-1.html', #Action Adventure Movie GIF Dynamic Map Origin
'http://gifcc.com/forum-50-1.html'# Horror thriller movie GIF Dynamic Map Origin
Yes, yes, if visited as a tourist, the websites of each section are in this form: http://gifcc.com/forum-XX-1.html.
So what are the rules of content in each module? Let's go straight to the picture above:



What we are concerned about is the current page's address and the number of pages. When we jump to the second page, the address becomes: http://gifcc.com/forum-38-2.html.
That is to say, the rule of Web sites is http://gifcc.com/forum-XX-XX.html.
Note here that the pictures of the website are loaded dynamically, and only when you slide down, the pictures below will gradually appear. This is for the time being.


3. The rule of the page where each motion map is located


Actually, this is not a rule, but as long as we find the address of a single picture, there is nothing difficult to deal with.

Second Start


1. Get the content of the entry page
That is, according to the incoming URL, get the source code of the entire page.
  1. #Get only the content of the page  
  2. def get_html_Pages(self,url):    
  3.     try:     
  4.         #browser = webdriver.PhantomJS(executable_path=r'C:\Python27\Scripts\phantomjs.exe')   
  5.         browser = webdriver.PhantomJS()   
  6.         browser.get(url)  
  7.         html = browser.execute_script("return document.documentElement.outerHTML")  
  8.         browser.close()  
  9.         html=HTMLParser.HTMLParser().unescape(html).decode('utf-8')  
  10.         return html  
  11.        #Catch exceptions to prevent programs from dying directly.  
  12.     except Exception,e:    
  13.         print u"Connection Failure, Error Cause",e  
  14.         return None     
Here we use webdriver and Phantom JS modules. Why? Because web pages are loaded dynamically, this can capture a little bit of data.
So here's another question, why there's no sliding, what's the data?

2. Get the number of pages
  1. #Get the page number.  
  2. def get_page_num(self,html):  
  3.   
  4.     doc = pq(html)    
  5.     print u'Start getting the total page number'  
  6.     #print doc('head')('title').text()#Get the current title  
  7.     try:  
  8.         #If the current page is too many, more than eight pages, use another way to get the page number  
  9.         if doc('div[class="pg"]')('[class="last"]'):  
  10.             num_content= doc('div[class="pg"]')('[class="last"]').attr('href')  
  11.             print  num_content.split('-')[1].split('.')[0]  
  12.             return num_content.split('-')[1].split('.')[0]  
  13.         else:  
  14.             num_content= doc('div[class="pg"]')('span')  
  15.             return filter(str.isdigit,str(num_content.text()))[0]  
  16.     #If the acquisition of the page number fails, then return 1, i.e., the value of the acquisition of a page content.  
  17.     except Exception,e:  
  18.         print u'Failed to get page number'.e  
  19.         return '1'  

Here, a module pq, PyQuery, is used for page number processing.
 from pyquery import PyQuery as pq 
Using PyQuery to find the elements we need, it feels better and more convenient.
At the same time, the processing here is a little interesting, if you look at this page, you will find that the number of pages of each module, there is one above and below, and then I cut it here, because we only need a number of pages.


3-6 Step 3 to Step 6
In fact, it is based on the number of pages to traverse, to get the content of each page.
Then get all the picture addresses on each page.

  1. print  u'All in all %d Page content' % int(page_num)  
  2. #3. Traversing through the content of each page  
  3. for num in range(1,int(page_num)):  
  4.     #4. Assemble new URLs  
  5.     new_url = self.url.replace( self.url.split('-')[2],(str(num)+'.html') )  
  6.     print u'The upcoming page is:',new_url  
  7.     #5. Load each page to get the contents of the gif list  
  8.     items=self.parse_items_by_html(self.get_all_page(new_url))  
  9.     print u'In the first place%d Page, found%d Picture content' % (num,len(items))  
  10.     #6. Processing the content of each element  
  11.     self.get_items_url(items,num)  
When retrieving the content of each page, you need to reassemble the page address.
  1. #4. Assemble new URLs  
  2.                 new_url = self.url.replace( self.url.split('-')[2],(str(num)+'.html') )  
  3.                 print u'The upcoming page is:',new_url  
With the new address, you can get the content of the current page and process the data to get the address list of each picture.

  1. #5. Load each page to get the contents of the gif list  
  2.                 items=self.parse_items_by_html(self.get_all_page(new_url))  
  3.                 print u'In the first place%d Page, found%d Picture content' % (num,len(items))  
  1. #Parse the page content to get the list of gif images  
  2. def parse_items_by_html(self, html):    
  3.     doc = pq(html)    
  4.     print u'Start looking for content msg'       
  5.     return doc('div[class="c cl"]')  
After getting the list of pictures, parse it again to get the URL of each picture.
  1. #Parse the list of gif to process each gif content  
  2.     def get_items_url(self,items,num):  
  3.         i=1  
  4.         for article in items.items():  
  5.             print u'Start processing data(%d/%d)' % (i, len(items))  
  6.             #print article  
  7.             self.get_single_item(article,i,num)  
  8.             i +=1  
  9.       
  10.     #Processing a single gif content to get its address, gif final address  
  11.     def get_single_item(self,article,num,page_num):  
  12.         gif_dict={}  
  13.         #Address of each page  
  14.         gif_url= 'http://gifcc.com/'+article('a').attr('href')  
  15.         #Title of each page  
  16.         gif_title= article('a').attr('title')  
  17.           
  18.         #The specific address of each graph  
  19.         #html=self.get_html_Pages(gif_url)  
  20.         #gif_final_url=self.get_final_gif_url(html)  
  21.        
  22.         gif_dict['num']=num  
  23.         gif_dict['page_num']=page_num  
  24.         gif_dict['gif_url']=gif_url  
  25.         gif_dict['gif_title']=gif_title  
  26.         self.gif_list.append(gif_dict)  
  27.         data=u'The first'+str(page_num)+'page|\t'+str(num)+'|\t'+gif_title+'|\t'+gif_url+'\n'  
  28.         self.file_flag.write(data)  
Here, integrate the data to prepare for writing the data to the database.

7. Store pictures locally and write data to the database

  1. #Use urllib2 to get the final address of the picture  
  2.     def get_final_gif_url_use_urllib2(self,url):  
  3.         try:  
  4.             html= urllib2.urlopen(url).read()  
  5.             gif_pattern=re.compile('<div align="center.*?<img id=.*?src="(.*?)" border.*?>',re.S)  
  6.             return re.search(gif_pattern,html).group(1)  
  7.         except Exception,e:  
  8.             print u'Error getting page content:',e  
  9.     #Final Processing Storage of Data  
  10.     def get_gif_url_and_save_gif(self):  
  11.         def save_gif(url,name):  
  12.             try:  
  13.                 urllib.urlretrieve(url, name)  
  14.             except Exception,e:  
  15.                 print 'Storage failure due to:',e  
  16.         for i in range(0,len(self.gif_list)):  
  17.             gif_dict=self.gif_list[i]  
  18.             gif_url=gif_dict['gif_url']  
  19.             gif_title=gif_dict['gif_title']  
  20.               
  21.             #Still use webdriver to get the final gif address  
  22.             final_html=self.get_html_Pages(gif_url)  
  23.             gif_final_url=self.get_final_gif_url(final_html)  
  24.             #Use another way (urllib2) to get the final address  
  25.             #gif_final_url=self.get_final_gif_url_use_urllib2(gif_url)  
  26.               
  27.             gif_dict['gif_final_url']=gif_final_url  
  28.             print u'Start writing to the database%d Page No.%d Item data and start storing pictures locally ' % (gif_dict['page_num'],gif_dict['num'])  
  29.             self.BookTable.insert_one(gif_dict)  
  30.             gif_name=self.dir_name+'/'+gif_title+'.gif'  
  31.             save_gif(gif_final_url,gif_name)  
So far, the general content has been completed.

We can save all the dynamic maps of each module of this forum locally, and at the same time, put the data into the database.

Screening of three databases

After putting the data into the database, I thought I could save the pictures directly by calling the database.
(Why do you think so, because I found that if you store pictures directly in the main program, it runs too slowly, so it's better to put all the data in the database, and then call the database to store pictures exclusively.)
But here's a problem. There's a lot of content in the data, and then there's a lot of duplication, so we need to de-duplicate the database.
As for the content of data de-duplication, in fact, I have already written the previous article (at the time of writing that article, the crawler has been completed ~)
The main idea is to operate on the number of a certain element. There is a method in pymongo that can count the number of specified elements. If there is only one element at present, it will be deleted regardless of whether it is an element or not.
The core code is as follows:

  1. for url in collection.distinct('name'):#Using distinct method, get a list of each unique element  
  2.         num= collection.count({"name":url})#Statistics the number of each element  
  3.         print num  
  4.         for i in range(1,num):#According to the number of each element, delete operation, the current element has only one will not be deleted.  
  5.             print 'delete %s %d times '% (url,i)  
  6.             #Notice the latter parameter. Strangely, on the mongo command line, it deletes an element at 1:00, but here it deletes an element at 0:00.  
  7.             collection.remove({"name":url},0)  
  8.         for i in  collection.find({"name":url}):#Print all current elements  
  9.             print i  
  10.     print collection.distinct('name')#Print again the elements to be duplicated  

Fourth, read the contents of the database and store pictures

It's much more convenient to store pictures again after data is de-duplicated.
After that, if the image is deleted, and you don't have to run away again, or sometimes the local image takes up space, you just need to save the data in the database.
The core code is as follows:

  1. def save_gif(url,name):  
  2.     try:  
  3.         urllib.urlretrieve(url, name)  
  4.     except Exception,e:  
  5.         print u'Storage failure due to:',e  
  6. client = pymongo.MongoClient('localhost'27017)   
  7. print client.database_names()  
  8.   
  9.   
  10. db = client.GifDB  
  11. for table in  db.collection_names():  
  12.     print 'table name is ',table  
  13.     collection=db[table]  
  14.   
  15.     for item in  collection.find():  
  16.         try:   
  17.             if item['gif_final_url']:  
  18.                 url,url_title= item['gif_final_url'],item['gif_title']  
  19.                 gif_filename=table+'/'+url_title+'.gif'  
  20.                 print 'start save %s, %s' % (url,gif_filename)  
  21.                 save_gif(url,gif_filename)  
  22.         except Exception,e:  
  23.             print u'Reasons for error:',e  

Complete code

01_get_gif_url.py
  1. #coding: utf-8   
  2. from pyquery import PyQuery as pq    
  3. from selenium import webdriver   
  4. import HTMLParser,urllib2,urllib,re,os  
  5.   
  6. import pymongo  
  7.   
  8. import time  
  9. import sys    
  10. reload(sys)    
  11. sys.setdefaultencoding('utf-8')    
  12. class download_gif:  
  13.     def __init__(self):  
  14.         self.url='http://gifcc.com/forum-38-1.html'  
  15.         self.url_list=['http://gifcc.com/forum-37-1.html',#Other sources of various GIF dynamic maps  
  16.         'http://gifcc.com/forum-38-1.html'#Beauty GIF Dynamic Map Origin  
  17.         'http://gifcc.com/forum-47-1.html',#The Origin of GIF Dynamic Map of Science Fiction Fantasy Movie  
  18.         'http://gifcc.com/forum-48-1.html',#The Origin of GIF Dynamic Map of Comedy Funny Movie  
  19.         'http://gifcc.com/forum-49-1.html',#Origin of Action Adventure Movie GIF Dynamic Map  
  20.         'http://gifcc.com/forum-50-1.html'#The Origin of GIF Dynamic Map of Horror and Thriller Movie  
  21.         ]  
  22.         self.choices={'1':u'Other kinds GIF Origin of Dynamic Graph',  
  23.         '2':u'Beauty GIF Origin of Dynamic Graph',  
  24.         '3':u'Science fiction fantasy movies GIF Origin of Dynamic Graph',  
  25.         '4':u'Comedy Funny Film GIF Origin of Dynamic Graph',  
  26.         '5':u'Action Adventure Film GIF Origin of Dynamic Graph',  
  27.         '6':u'Horror thriller movies GIF Origin of Dynamic Graph'  
  28.         }  
  29.           
  30.         self.dir_name=u'gif Source'  
  31.         self.gif_list=[]  
  32.           
  33.         self.connection = pymongo.MongoClient()    
  34.           
  35.         #BookTable.insert_one(dict_data)#Insert single data  
  36.         #BookTable.insert(dict_data)#Insert dictionary list data  
  37.           
  38.     #Get the content of the page, and load JS, scroll to get more elements of the page  
  39.     def get_all_page(self,url):  
  40.         try:  
  41.             #browser = webdriver.PhantomJS(executable_path=r'C:\Python27\Scripts\phantomjs.exe')   
  42.             browser = webdriver.PhantomJS()   
  43.             browser.get(url)  
  44.             #time.sleep(3)   
  45.             #Page scroll  
  46.             js = "var q=document.body.scrollTop=100000"      
  47.             #for i in range(5):  #Debugging statements, do not load too many times for the time being  
  48.             for i in range(30):    
  49.                 #Loop down 50 times.  
  50.                 browser.execute_script(js)    
  51.                 #Load once, take a rest.  
  52.                 time.sleep(1)  
  53.                 print u'This is the first. %d Subscratch Page' % i  
  54.             #Executing js to get the whole page content  
  55.             html = browser.execute_script("return document.documentElement.outerHTML")  
  56.             browser.close()  
  57.             html=HTMLParser.HTMLParser().unescape(html)  
  58.             return html  
  59.         except Exception,e:  
  60.             print u'Errors occurred:',e  
  61.       
  62.     #Parse the page content to get the list of gif images  
  63.     def parse_items_by_html(self, html):    
  64.         doc = pq(html)    
  65.         print u'Start looking for content msg'       
  66.         return doc('div[class="c cl"]')  
  67.           
  68.     #Parse the list of gif to process each gif content  
  69.     def get_items_url(self,items,num):  
  70.         i=1  
  71.         for article in items.items():  
  72.             print u'Start processing data(%d/%d)' % (i, len(items))  
  73.             #print article  
  74.             self.get_single_item(article,i,num)  
  75.             i +=1  
  76.       
  77.     #Processing a single gif content to get its address, gif final address  
  78.     def get_single_item(self,article,num,page_num):  
  79.         gif_dict={}  
  80.         #Address of each page  
  81.         gif_url= 'http://gifcc.com/'+article('a').attr('href')  
  82.         #Title of each page  
  83.         gif_title= article('a').attr('title')  
  84.           
  85.         #The specific address of each graph  
  86.         #html=self.get_html_Pages(gif_url)  
  87.         #gif_final_url=self.get_final_gif_url(html)  
  88.        
  89.         gif_dict['num']=num  
  90.         gif_dict['page_num']=page_num  
  91.         gif_dict['gif_url']=gif_url  
  92.         gif_dict['gif_title']=gif_title  
  93.         self.gif_list.append(gif_dict)  
  94.         data=u'The first'+str(page_num)+'page|\t'+str(num)+'|\t'+gif_title+'|\t'+gif_url+'\n'  
  95.         self.file_flag.write(data)  
  96.       
  97.     #After getting the content of the page through webdriver, get the final address  
  98.     def get_final_gif_url(self,html):  
  99.         doc = pq(html)   
  100.         image_content= doc('td[class="t_f"]')  
  101.         gif_url= image_content('img').attr('src')  
  102.         return gif_url  
  103.       
  104.     #Use urllib2 to get the final address of the picture  
  105.     def get_final_gif_url_use_urllib2(self,url):  
  106.         try:  
  107.             html= urllib2.urlopen(url).read()  
  108.             gif_pattern=re.compile('<div align="center.*?<img id=.*?src="(.*?)" border.*?>',re.S)  
  109.             return re.search(gif_pattern,html).group(1)  
  110.         except Exception,e:  
  111.             print u'Error getting page content:',e  
  112.     #Final Processing Storage of Data  
  113.     def get_gif_url_and_save_gif(self):  
  114.         def save_gif(url,name):  
  115.             try:  
  116.                 urllib.urlretrieve(url, name)  
  117.             except Exception,e:  
  118.                 print 'Storage failure due to:',e  
  119.         for i in range(0,len(self.gif_list)):  
  120.             gif_dict=self.gif_list[i]  
  121.             gif_url=gif_dict['gif_url']  
  122.             gif_title=gif_dict['gif_title']  
  123.               
  124.             #Still use webdriver to get the final gif address  
  125.             final_html=self.get_html_Pages(gif_url)  
  126.             gif_final_url=self.get_final_gif_url(final_html)  
  127.             #Use another way (urllib2) to get the final address  
  128.             #gif_final_url=self.get_final_gif_url_use_urllib2(gif_url)  
  129.               
  130.             gif_dict['gif_final_url']=gif_final_url  
  131.             print u'Start writing to the database%d Page No.%d Item data and start storing pictures locally ' % (gif_dict['page_num'],gif_dict['num'])  
  132.             self.BookTable.insert_one(gif_dict)  
  133.             gif_name=self.dir_name+'/'+gif_title+'.gif'  
  134.             save_gif(gif_final_url,gif_name)  
  135.           
  136.     #Get only the content of the page  
  137.     def get_html_Pages(self,url):    
  138.         try:     
  139.             #browser = webdriver.PhantomJS(executable_path=r'C:\Python27\Scripts\phantomjs.exe')   
  140.             browser = webdriver.PhantomJS()   
  141.             browser.get(url)  
  142.             html = browser.execute_script("return document.documentElement.outerHTML")  
  143.             browser.close()  
  144.             html=HTMLParser.HTMLParser().unescape(html).decode('utf-8')  
  145.             return html  
  146.         #Catch exceptions to prevent programs from dying directly.  
  147.         except Exception,e:    
  148.             print u"Connection Failure, Error Cause",e  
  149.             return None     
  150.       
  151.     #Get the page number.  
  152.     def get_page_num(self,html):  
  153.   
  154.         doc = pq(html)    
  155.         print u'Start getting the total page number'  
  156.         #print doc('head')('title').text()#Get the current title  
  157.         try:  
  158.             #If the current page is too many, more than eight pages, use another way to get the page number  
  159.             if doc('div[class="pg"]')('[class="last"]'):  
  160.                 num_content= doc('div[class="pg"]')('[class="last"]').attr('href')  
  161.                 print  num_content.split('-')[1].split('.')[0]  
  162.                 return num_content.split('-')[1].split('.')[0]  
  163.             else:  
  164.                 num_content= doc('div[class="pg"]')('span')  
  165.                 return filter(str.isdigit,str(num_content.text()))[0]  
  166.         #If the acquisition of the page number fails, then return 1, i.e., the value of the acquisition of a page content.  
  167.         except Exception,e:  
  168.             print u'Failed to get page number'.e  
  169.             return '1'  
  170.               
  171.         # filter(str.isdigit,num_content)#Extracting numbers from strings  
  172.           
  173.     #Create folders  
  174.     def mk_dir(self,path):  
  175.         if not os.path.exists(path):    
  176.             os.makedirs(path)   
  177.               
  178.     def set_db(self,tablename):  
  179.         self.BookDB = self.connection.GifDB         #Name of database db  
  180.         self.BookTable =self.BookDB[tablename]           #Name of database table table  
  181.               
  182.     #Main function  
  183.     def run(self):  
  184.         choice_type=5  
  185.         if choice_type:  
  186.         #for choice_type in range(len(self.choices)):   
  187.             if  choice_type+1:  
  188.                   
  189.                 self.dir_name=self.choices[str(choice_type+1)].strip()  
  190.                 self.url=self.url_list[int(choice_type)]  
  191.                   
  192.                   
  193.                 print self.dir_name,self.url  
  194.           
  195.           
  196.             #0. Create folders to store pictures and file contents  
  197.             self.mk_dir(self.dir_name)  
  198.             self.filename=self.dir_name+'/'+self.dir_name+'.txt'  
  199.             print self.filename  
  200.             self.file_flag=open(self.filename,'w')  
  201.               
  202.             self.set_db(self.dir_name)  
  203.             self.BookTable .insert({'filename':self.dir_name})  
  204.               
  205.             print self.url  
  206.             #1. Get the content of the entry page  
  207.             html=self.get_html_Pages(self.url)  
  208.               
  209.             #2. Get the number of pages  
  210.             page_num=self.get_page_num(html)  
  211.               
  212.             print  u'All in all %d Page content' % int(page_num)  
  213.             #3. Traversing through the content of each page  
  214.               
  215.             #page_num=3#Debugging statements, first temporarily set the content of the page a little smaller  
  216.             for num in range(1,int(page_num)):  
  217.                 #4. Assemble new URLs  
  218.                 new_url = self.url.replace( self.url.split('-')[2],(str(num)+'.html') )  
  219.                 print u'The upcoming page is:',new_url  
  220.                 #5. Load each page to get the contents of the gif list  
  221.                 items=self.parse_items_by_html(self.get_all_page(new_url))  
  222.                 print u'In the first place%d Page, found%d Picture content' % (num,len(items))  
  223.                 #6. Processing the content of each element  
  224.                 self.get_items_url(items,num)  
  225.               
  226.             #5. After all the data has been captured, we begin to process the data.  
  227.             self.get_gif_url_and_save_gif()  
  228.             print 'success'  
  229.               
  230.             self.file_flag.close()  
  231.           
  232.               
  233. if __name__ == '__main__':    
  234.     print u''''' 
  235.             **************************************************   
  236.             **    Welcome to Spider of  GIF Source picture        **   
  237.             **         Created on 2017-05-21                **   
  238.             **         @author: Jimy _Fengqi                **   
  239.             **************************************************  
  240.     '''           
  241.     print u''''' Select what you want to download gif Picture type 
  242.         1:'Other kinds GIF Origin of Dynamic Graph' 
  243.         2:'Beauty GIF Origin of Dynamic Graph' 
  244.         3:'Science fiction fantasy movies GIF Origin of Dynamic Graph' 
  245.         4:'Comedy Funny Film GIF Origin of Dynamic Graph' 
  246.         5:'Action Adventure Film GIF Origin of Dynamic Graph' 
  247.         6:'Horror thriller movies GIF Origin of Dynamic Graph' 
  248.         '''  
  249.     #Select the type to download.  
  250.   
  251.     mydownload=download_gif()  
  252.     html=mydownload.run()  
02_delete_repeat_url_in_mongodb.py

  1. #coding: utf-8   
  2. from pyquery import PyQuery as pq    
  3. from selenium import webdriver   
  4. import HTMLParser,urllib2,urllib,re,os  
  5.   
  6. import pymongo  
  7.   
  8. import time  
  9. import sys    
  10. reload(sys)    
  11. sys.setdefaultencoding('utf-8')    
  12.   
  13. import pymongo    
  14.   
  15. def save_gif(url,name):  
  16.     try:  
  17.         urllib.urlretrieve(url, name)  
  18.     except Exception,e:  
  19.         print 'Storage failure due to:',e  
  20.           
  21. def print_database_and_table_name():      
  22.     import pymongo  
  23.     client = pymongo.MongoClient('localhost'27017)   
  24.     print client.database_names()  
  25.   
  26.     for database in client.database_names():  
  27.         for table in  client[database].collection_names():  
  28.             print 'table  [%s]  is in database [%s]' % (table,database)  
  29.   
  30. def delete_single_database_repeat_data():  
  31.     import pymongo  
  32.     client = pymongo.MongoClient('localhost'27017)   
  33.     db=client.GifDBtemptemp2#Here's the database name for the data to be cleaned  
  34.     for table in  db.collection_names():  
  35.         print 'table name is ',table  
  36.         collection=db[table]  
  37.         for url in collection.distinct('gif_title'):#Using distinct method, get a list of each unique element  
  38.             num= collection.count({"gif_title":url})#Statistics the number of each element  
  39.             print num  
  40.             for i in range(1,num):#According to the number of each element, delete operation, the current element has only one will not be deleted.  
  41.                 print 'delete %s %d times '% (url,i)  
  42.                 #Notice the latter parameter. Strangely, on the mongo command line, it deletes an element at 1:00, but here it deletes an element at 0:00.  
  43.                 collection.remove({"gif_title":url},0)  
  44.             for i in  collection.find({"gif_title":url}):#Print all current elements  
  45.                 print i  
  46.   
  47. def delete_repeat_data():  
  48.     import pymongo  
  49.     client = pymongo.MongoClient('localhost'27017)   
  50.     db = client.local  
  51.     collection = db.person  
  52.       
  53.     for url in collection.distinct('name'):#Using distinct method, get a list of each unique element  
  54.         num= collection.count({"name":url})#Statistics the number of each element  
  55.         print num  
  56.         for i in range(1,num):#According to the number of each element, delete operation, the current element has only one will not be deleted.  
  57.             print 'delete %s %d times '% (url,i)  
  58.             #Notice the latter parameter. Strangely, on the mongo command line, it deletes an element at 1:00, but here it deletes an element at 0:00.  
  59.             collection.remove({"name":url},0)  
  60.         for i in  collection.find({"name":url}):#Print all current elements  
  61.             print i  
  62.     print collection.distinct('name')#Print again the elements to be duplicated  
  63. delete_single_database_repeat_data()  

03_from_mongodb_save_pic.py

  1. #coding: utf-8   
  2. from pyquery import PyQuery as pq    
  3. from selenium import webdriver   
  4. import HTMLParser,urllib2,urllib,re,os  
  5.   
  6. import pymongo  
  7.   
  8. import time  
  9. import sys    
  10. reload(sys)    
  11. sys.setdefaultencoding('utf-8')    
  12.   
  13. import pymongo    
  14.   
  15. def save_gif(url,name):  
  16.     try:  
  17.         urllib.urlretrieve(url, name)  
  18.     except Exception,e:  
  19.         print u'Storage failure due to:',e  
  20. client = pymongo.MongoClient('localhost'27017)   
  21. print client.database_names()  
  22.   
  23.   
  24. db = client.GifDB  
  25. for table in  db.collection_names():  
  26.     print 'table name is ',table  
  27.     collection=db[table]  
  28.   
  29.     for item in  collection.find():  
  30.         try:   
  31.             if item['gif_final_url']:  
  32.                 url,url_title= item['gif_final_url'],item['gif_title']  
  33.                 gif_filename=table+'/'+url_title+'.gif'  
  34.                 print 'start save %s, %s' % (url,gif_filename)  
  35.                 save_gif(url,gif_filename)  
  36.         except Exception,e:  
  37.             print u'Reasons for error:',e  
Github address: https://github.com/JimyFengqi/Gif_Spide
r

Posted by rar_ind on Thu, 16 May 2019 03:55:00 -0700