conclusion
Let's start with the conclusion
- urlparse and urlplit are generally used to analyze the structure of web page url, so as to quickly extract various parameters in web page, such as protocol, domain name, path, query field and so on
- The difference between urlparse and urlplit is that urlplit does not match params
Use and difference of urlparse and urlplit
First, let's look at a standard url format
scheme://username:password@hostname:port/path;params?query#fragment
The meaning of each parameter is as follows:
- scheme: Protocol
- username:password: indicates the account and password used for authentication, but it is generally not used
- hostname: host (IP / domain name)
- port: port
- path: path
- params: parameter (split by)
- Query: query (split by &)
- fragment: anchor point, or location, used for web page positioning
After understanding the url parameters, let's take a look at the following example:
from urllib.parse import urlparse url = "https://root:123456@www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib" result = urlparse(url) print("scheme:", result.scheme) print("host:", result.hostname) print("port:", result.port) print("path:", result.path) print("params:", result.params) print("query:", result.query) print("fragment:", result.fragment)
Operation results:
scheme: https username: root password: 123456 host: www.abc.com port: 8083 path: /uploads/ params: type=docx query: filename=python3.docx fragment: urllib
You can see that urlparse parses various parameters (in string form), which is the basic and main use of urlparse. So what is the difference between urlplit and urlparse?
The answer is that urlplit cannot match the params field. Let's change the above code to see the results
from urllib.parse import urlsplit url = "https://root:123456@www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib" result = urlsplit(url) print(result) print("scheme:", result.scheme) print("username:", result.username) print("password:", result.password) print("host:", result.hostname) print("port:", result.port) print("path:", result.path) # print("params:", result.params) print("query:", result.query) print("fragment:", result.fragment)
Operation results:
SplitResult(scheme='https', netloc='root:123456@www.abc.com:8083', path='/uploads/;type=docx', query='filename=python3.docx', fragment='urllib') scheme: https username: root password: 123456 host: www.abc.com port: 8083 path: /uploads/;type=docx query: filename=python3.docx fragment: urllib
You can see that there is no params attribute in the generated SplitResult object. At the same time, it can be found that the params attribute appears in the path; type=docx. If you don't feel at ease, you can also use the dir method to check, and the result is indeed the same
>>> from urllib.parse import urlsplit, urlparse >>> url = "https://www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib" >>> dir(urlsplit(url)) ['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_field_defaults', '_fields', '_hostinfo', '_make', '_replace', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'password', 'path', 'port', 'query', 'scheme', 'username'] >>> dir(urlparse(url)) ['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_field_defaults', '_fields', '_hostinfo', '_make', '_replace', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'params', 'password', 'path', 'port', 'query', 'scheme', 'username']
Therefore, it is recommended to use urlparse if you want better compatibility
Common properties and methods of urlparse (corresponding to ParseResult object) and urlplit (corresponding to SplitResult object)
attribute
- scheme: Protocol
- username: the user name used for authentication
- Password: the password used for authentication
- hostname: host name
- port: port
- netloc: equivalent to username:password@hostname:port
- path: path
- params: parameter (SplitResult object has no)
- Query: query parameters
- fragment: anchor point, or location, used for web page positioning
method
- geturl: returns the url of the current object
- encode(encoding = 'ascii',...): encodes the current object as a new stream object. Compared with the original method, the method in the new object has one more decode method and one less encode method
- count/index: not commonly used. It returns the number of occurrences and the location of the first occurrence of a string in the tuple form of the object. However, it should be noted that the tuple form of the object is similar to ('https', 'www.abc.com', '/uploads/', 'type=docx')