Download of python 3.x crawler Jinjiang novel (use cookie to get VIP chapter)

Preface

As a girl, I like to read novels in Jinjiang. However, the website of Jinjiang is very magical. The article says that we can only write a crawler program to save novels

At first, I chose the format of txt to save the novel. Later, I decided to save the novel in the format of epub. Because I found that it is also a novel. epub format not only has rich functions and beautiful interface, but also is much smaller than the file in TXT format If the novel is saved in ASCII code, the size is much smaller Some characters will be lost

I will also download the source code of the txt file, but that version hasn't been updated for a long time, so there are many bug s.

The project is stored in github, and it is planned to write the program again in c.

(I pieced together this program. It was originally written for my own use. I didn't expect to write a blog at all. Although it can run well, but the naming is very irregular. Please don't follow me.)

https://github.com/7325156/jjdown

python Library

The following libraries need to be installed

pip install requests
pip install lxml
pip install selenium is used to obtain cookie s. It can not be installed
PIP install opencc Python reimplemented - simple conversion, can not be installed

Get cookie

The cookie stores the user's login information. If you want to crawl the purchased vip chapter, you must add the cookie to the header.

There are many ways to get cookies. It is recommended to log in to Jinjiang website, press F12 to enter developer mode, enter document.cookie in console interface, and press enter to get cookies.

Note: cookies in Jinjiang sometimes fail. If they fail, it is recommended to use IE browser or Edge browser to retrieve the cookies again. If not, delete the "timeoffset" item in the cookies.

Here is the procedure for obtaining cookie s. I don't think it's very useful. It's troublesome to use, and the optimization is not enough. Just have a look, and you can use it if you want.

Before using, you need to download the chord driver, which is placed in the python path.

from selenium import webdriver
import time

#You need to install chormedriver and store it in the python path
driver=webdriver.Chrome()

driver.delete_all_cookies()
driver.get("http://my.jjwxc.net/login.php")



username=input("Please enter the user name:")
passwd=input("Please input a password:")
driver.find_element_by_id("loginname").send_keys(username)
driver.find_element_by_id("loginpassword").send_keys(passwd)
driver.find_element_by_xpath("//*[@id='login_submit_tr']/input").click()

cookies = driver.get_cookies()
cookies_list= []
for cookie_dict in cookies:
     cookie =cookie_dict['name']+'='+cookie_dict['value']
     cookies_list.append(cookie)
header_cookie = ';'.join(cookies_list)
   
headers = {
    'cookie':header_cookie,
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}

print(headers)

driver.quit()

Add cookie s to the header to get all chapters and volume labels

Paste the obtained cookie directly in the colon after the cookie

headerss={'cookie': ' ',
          'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}

Txt? ID is the article number. Every novel in Jinjiang has a number. The number is the number following the website. Let me give you a random example. If the website of the novel is http://www.jjwxc.net/onebook.php?novelid=2710871 , then 2710871 is the novel number.

Use requests to add a header to send a request to the web address, and then re code the article. If you don't code, you will output a heap of garbled code.

#Get article URL
req_url=req_url_base+ids
#Get article information through cookie s
res=requests.get(req_url,headers=headerss).content
#Code the article
ress=etree.HTML(res.decode("GB18030","ignore").encode("utf-8","ignore").decode('utf-8'))

Next, use xpath to get the article information. You are interested in the rules of xpath, but I use the simplest and rudest method here.

Use Google browser to move the mouse to the location of the element to be acquired, right-click to check to move the mouse to the label to be acquired, right-click to copy to copy XPath. Get it done.

Just paste the obtained xpath to the specified location and replace the double quotation mark with the single quotation mark.

Note: Google browser should remove the tbody tag.

Here, use xpath to get information. If you want to get more information, you can also modify it appropriately. Non vip chapters and vip chapters should be obtained separately.

#Getting copywriting
intro=ress.xpath("//html/body/table/tr/td[1]/div[2]/div[@id='novelintro']//text()")
#Get Tags
info=ress.xpath("string(/html/body/table[1]/tr/td[1]/div[3])")
#Get article information
infox=[]
for i in range(1,7):
    infox.append(ress.xpath("string(/html/body/table[1]/tr/td[3]/div[2]/ul/li["+str(i)+"])"))
#Get cover
cover=ress.xpath("string(/html/body/table[1]/tr/td[1]/div[2]/img/@src)")
#Get title
titlem=ress.xpath("//html/head/title/text()")

#Get all chapter links
#Non vip
href_list=ress.xpath("//html/body/table[@id='oneboolt']//tr/td[2]/span/div[1]/a/@href")
#vip
hhr=ress.xpath("//html/body/table[@id='oneboolt']//tr/td[2]/span/div[1]/a[1]/@rel")

As chapters are often locked in Jinjiang, it is necessary to operate after obtaining the content summary and volume label. Personally, I don't think this place has been dealt with enough. I will continue to optimize it later.

#Summary of each chapter
#loc: locked chapter
loc=ress.xpath("//*[@id='oneboolt']//tr/td[2]/span/div[1]/span/ancestor-or-self::tr/td[3]/text()")
Summary=ress.xpath("//*[@id='oneboolt']//tr/td[3]/text()")
    
for i in Summary:
    if i.strip()=='[This chapter is locked]':
        del Summary[Summary.index(i)]
for i in Summary:
    if i.strip()=='[This chapter is locked]':
        del Summary[Summary.index(i)]
for i in Summary:
    if i in loc:
        del Summary[Summary.index(i)]
for i in Summary:
    if i in loc:
        del Summary[Summary.index(i)]
for i in Summary:
    if i in loc:
        del Summary[Summary.index(i)]

#Get volume label name
rollSign=ress.xpath("//*[@id='oneboolt']//tr/td/b[@class='volumnfont']/text()")
#Get label location
rollSignPlace=ress.xpath("//*[@id='oneboolt']//tr/td/b/ancestor-or-self::tr/following-sibling::tr[1]/td[2]/span/div[1]/a[1]/@href")
rollSignPlace+=ress.xpath("//*[@id='oneboolt']//tr/td/b/ancestor-or-self::tr/following-sibling::tr[1]/td[2]/span/div[1]/a[1]/@rel")

Next, the acquired information is processed and stored.

(so many first, the rest later)

OldHillSmokeRelease

Published 3 original articles, won praise 0, visited 1583

Private letter follow

Posted by maxic0 on Mon, 13 Jan 2020 07:48:52 -0800

Programmer Group

Download of python 3.x crawler Jinjiang novel (use cookie to get VIP chapter)

Preface

python Library

Get cookie

Add cookie s to the header to get all chapters and volume labels

Hot Keywords