[Python 3] the use of urlparse and urlplit and the difference between them

Keywords: Python

conclusion

Let's start with the conclusion

urlparse and urlplit are generally used to analyze the structure of web page url, so as to quickly extract various parameters in web page, such as protocol, domain name, path, query field and so on
The difference between urlparse and urlplit is that urlplit does not match params

Use and difference of urlparse and urlplit

First, let's look at a standard url format

scheme://username:password@hostname:port/path;params?query#fragment

The meaning of each parameter is as follows:

scheme: Protocol
username:password: indicates the account and password used for authentication, but it is generally not used
hostname: host (IP / domain name)
port: port
path: path
params: parameter (split by)
Query: query (split by &)
fragment: anchor point, or location, used for web page positioning

After understanding the url parameters, let's take a look at the following example:

from urllib.parse import urlparse

url = "https://root:123456@www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib"
result = urlparse(url)

print("scheme:", result.scheme)
print("host:", result.hostname)
print("port:", result.port)
print("path:", result.path)
print("params:", result.params)
print("query:", result.query)
print("fragment:", result.fragment)

Operation results:

scheme: https
username: root
password: 123456
host: www.abc.com
port: 8083
path: /uploads/
params: type=docx
query: filename=python3.docx
fragment: urllib

You can see that urlparse parses various parameters (in string form), which is the basic and main use of urlparse. So what is the difference between urlplit and urlparse?
The answer is that urlplit cannot match the params field. Let's change the above code to see the results

from urllib.parse import urlsplit

url = "https://root:123456@www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib"
result = urlsplit(url)

print(result)

print("scheme:", result.scheme)
print("username:", result.username)
print("password:", result.password)
print("host:", result.hostname)
print("port:", result.port)
print("path:", result.path)
# print("params:", result.params)
print("query:", result.query)
print("fragment:", result.fragment)

Operation results:

SplitResult(scheme='https', netloc='root:123456@www.abc.com:8083', path='/uploads/;type=docx', query='filename=python3.docx', fragment='urllib')
scheme: https
username: root
password: 123456
host: www.abc.com
port: 8083
path: /uploads/;type=docx
query: filename=python3.docx
fragment: urllib

You can see that there is no params attribute in the generated SplitResult object. At the same time, it can be found that the params attribute appears in the path; type=docx. If you don't feel at ease, you can also use the dir method to check, and the result is indeed the same

>>> from urllib.parse import urlsplit, urlparse
>>> url = "https://www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib"
>>> dir(urlsplit(url))
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_field_defaults', '_fields', '_hostinfo', '_make', '_replace', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'password', 'path', 'port', 'query', 'scheme', 'username']
>>> dir(urlparse(url))
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_field_defaults', '_fields', '_hostinfo', '_make', '_replace', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'params', 'password', 'path', 'port', 'query', 'scheme', 'username']

Therefore, it is recommended to use urlparse if you want better compatibility

Common properties and methods of urlparse (corresponding to ParseResult object) and urlplit (corresponding to SplitResult object)

attribute

scheme: Protocol
username: the user name used for authentication
Password: the password used for authentication
hostname: host name
port: port
netloc: equivalent to username:password@hostname:port
path: path
params: parameter (SplitResult object has no)
Query: query parameters
fragment: anchor point, or location, used for web page positioning

method

geturl: returns the url of the current object
encode(encoding = 'ascii',...): encodes the current object as a new stream object. Compared with the original method, the method in the new object has one more decode method and one less encode method
count/index: not commonly used. It returns the number of occurrences and the location of the first occurrence of a string in the tuple form of the object. However, it should be noted that the tuple form of the object is similar to ('https', 'www.abc.com', '/uploads/', 'type=docx')

Posted by pjoshi on Wed, 24 Nov 2021 03:25:20 -0800

Programmer Group

[Python 3] the use of urlparse and urlplit and the difference between them

conclusion

Use and difference of urlparse and urlplit

Common properties and methods of urlparse (corresponding to ParseResult object) and urlplit (corresponding to SplitResult object)

Hot Keywords