[Python 3] the use of urlparse and urlplit and the difference between them

Keywords: Python

conclusion

Let's start with the conclusion

  • urlparse and urlplit are generally used to analyze the structure of web page url, so as to quickly extract various parameters in web page, such as protocol, domain name, path, query field and so on
  • The difference between urlparse and urlplit is that urlplit does not match params

Use and difference of urlparse and urlplit

First, let's look at a standard url format

scheme://username:password@hostname:port/path;params?query#fragment

The meaning of each parameter is as follows:

  • scheme: Protocol
  • username:password: indicates the account and password used for authentication, but it is generally not used
  • hostname: host (IP / domain name)
  • port: port
  • path: path
  • params: parameter (split by)
  • Query: query (split by &)
  • fragment: anchor point, or location, used for web page positioning

After understanding the url parameters, let's take a look at the following example:

from urllib.parse import urlparse

url = "https://root:123456@www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib"
result = urlparse(url)

print("scheme:", result.scheme)
print("host:", result.hostname)
print("port:", result.port)
print("path:", result.path)
print("params:", result.params)
print("query:", result.query)
print("fragment:", result.fragment)

Operation results:

scheme: https
username: root
password: 123456
host: www.abc.com
port: 8083
path: /uploads/
params: type=docx
query: filename=python3.docx
fragment: urllib

You can see that urlparse parses various parameters (in string form), which is the basic and main use of urlparse. So what is the difference between urlplit and urlparse?
The answer is that urlplit cannot match the params field. Let's change the above code to see the results

from urllib.parse import urlsplit

url = "https://root:123456@www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib"
result = urlsplit(url)

print(result)

print("scheme:", result.scheme)
print("username:", result.username)
print("password:", result.password)
print("host:", result.hostname)
print("port:", result.port)
print("path:", result.path)
# print("params:", result.params)
print("query:", result.query)
print("fragment:", result.fragment)

Operation results:

SplitResult(scheme='https', netloc='root:123456@www.abc.com:8083', path='/uploads/;type=docx', query='filename=python3.docx', fragment='urllib')
scheme: https
username: root
password: 123456
host: www.abc.com
port: 8083
path: /uploads/;type=docx
query: filename=python3.docx
fragment: urllib

You can see that there is no params attribute in the generated SplitResult object. At the same time, it can be found that the params attribute appears in the path; type=docx. If you don't feel at ease, you can also use the dir method to check, and the result is indeed the same

>>> from urllib.parse import urlsplit, urlparse
>>> url = "https://www.abc.com:8083/uploads/;type=docx?filename=python3.docx#urllib"
>>> dir(urlsplit(url))
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_field_defaults', '_fields', '_hostinfo', '_make', '_replace', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'password', 'path', 'port', 'query', 'scheme', 'username']
>>> dir(urlparse(url))
['__add__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_encoded_counterpart', '_field_defaults', '_fields', '_hostinfo', '_make', '_replace', '_userinfo', 'count', 'encode', 'fragment', 'geturl', 'hostname', 'index', 'netloc', 'params', 'password', 'path', 'port', 'query', 'scheme', 'username']

Therefore, it is recommended to use urlparse if you want better compatibility

Common properties and methods of urlparse (corresponding to ParseResult object) and urlplit (corresponding to SplitResult object)

attribute

  • scheme: Protocol
  • username: the user name used for authentication
  • Password: the password used for authentication
  • hostname: host name
  • port: port
  • netloc: equivalent to username:password@hostname:port
  • path: path
  • params: parameter (SplitResult object has no)
  • Query: query parameters
  • fragment: anchor point, or location, used for web page positioning

method

  • geturl: returns the url of the current object
  • encode(encoding = 'ascii',...): encodes the current object as a new stream object. Compared with the original method, the method in the new object has one more decode method and one less encode method
  • count/index: not commonly used. It returns the number of occurrences and the location of the first occurrence of a string in the tuple form of the object. However, it should be noted that the tuple form of the object is similar to ('https', 'www.abc.com', '/uploads/', 'type=docx')

Posted by pjoshi on Wed, 24 Nov 2021 03:25:20 -0800