Dockerfile is as follows
FROM python
RUN pip install -i http://pypi.douban.com/simple \
requests selenium retrying --trusted-host pypi.douban.com
docker-compose.yaml reads as follows
version: "3.7"
services:
myspider:
build: .
volumes: # Data Volume Mapping
- /root/mycode:/root/mycode
command: python /root/mycode/1.py
# Depending on the selenium service below, note that this dependency can only do so
# selenium service starts first, myspider service starts later (some service internal programs start fast, some slow)
# Basically, it can not solve the problem of complete dependence, so we can use delay processing and other methods.
depends_on:
- selenium
selenium:
image: selenium/standalone-chrome # Draw mirror to complete automatic configuration
ports:
- "4444:4444"
shm_size: 2g # Setting Host Shared Memory 2g
hostname: selenium # Other containers can use this name to access eg: http://selenium:4444/
The crawler script code 1.py is as follows
import requests
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from retrying import retry
# Note: Dependency_on in docker-compose.yaml mentions:
# Services can play a dependent role.
# And the startup process in the service can not achieve complete dependence (who is fast and who is slow lucky when the startup speed is comparable)
# Sequence can be controlled by adding delay.
# import time
# time.sleep(3) # This sleep delay, more or less inaccurate, can be replaced by the following retrying
# It can also be decorated by retrying module.
# retrying Usage for reference https://segmentfault.com/a/1190000019301761#articleHeader17
@retry(
stop_max_attempt_number = 10000,
stop_max_delay = 10*1000,
)
def verify_request():
response = requests.get("http://selenium:4444", timeout=0.5)
print(response)
verify_request()
# The following is basically a fixed way to connect Docker Selenium services, which can be used as a template set
with webdriver.Remote(
command_executor='http://Selenium: 4444 / WD / hub', selenium is the host name of docker-compose
desired_capabilities=DesiredCapabilities.CHROME
) as driver:
driver.get('http://www.baidu.com')
# Absolute paths are used here, otherwise data volume mapping fails
# The mapping section is in the volume section of docker-compose.yaml above.
with open('/root/mycode/test.html', 'w') as f:
f.write(driver.page_source)
print('Write successfully')
Trample
selenium has a server-side program, so we can deploy it in a remote "Docker container for cloud servers"
After container deployment...
"Only accessible from cloud servers,
Not accessible on remote servers.
(
In fact, there is no need to access remote servers, some evil idea, let me take a detour... Want remote access
In fact, the code is also deployed in the container, container interoperability is completely OK.
But if you want to try remote access, it really can't.
)"
(
In my opinion, since the cloud server host can access the server-side programs started inside the container
The remote server can't access the server-side programs started inside the container... That must be the connection configuration problem between container and host.
With this train of thought, I searched for a long time, and then I found an egg.
)
I can't help it. Scientifically surf the Internet, search for solutions to this problem.
Later, it was discovered by accident that the client could be connected now.
Later, the test, wow, really needs scientific Internet access to remote access, this server...
But I still don't understand why my cloud server host can successfully access the server side of the internal container without scientific Internet access.
(Although this doubt is unnecessary)