Skip to content

Commit 9fc443c

Browse files
committed
Issue 9873: the URL parsing functions now accept ASCII encoded byte sequences in addition to character strings
1 parent 43f0c27 commit 9fc443c

File tree

5 files changed

+603
-137
lines changed

5 files changed

+603
-137
lines changed

‎Doc/library/urllib.parse.rst‎

Lines changed: 170 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,15 @@ following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
2424
``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
2525
``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
2626

27-
The :mod:`urllib.parse` module defines the following functions:
27+
The :mod:`urllib.parse` module defines functions that fall into two broad
28+
categories: URL parsing and URL quoting. These are covered in detail in
29+
the following sections.
30+
31+
URL Parsing
32+
-----------
33+
34+
The URL parsing functions focus on splitting a URL string into its components,
35+
or on combining URL components into a URL string.
2836

2937
.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
3038

@@ -242,6 +250,161 @@ The :mod:`urllib.parse` module defines the following functions:
242250
string. If there is no fragment identifier in *url*, return *url* unmodified
243251
and an empty string.
244252

253+
The return value is actually an instance of a subclass of :class:`tuple`. This
254+
class has the following additional read-only convenience attributes:
255+
256+
+------------------+-------+-------------------------+----------------------+
257+
| Attribute | Index | Value | Value if not present |
258+
+==================+=======+=========================+======================+
259+
| :attr:`url` | 0 | URL with no fragment | empty string |
260+
+------------------+-------+-------------------------+----------------------+
261+
| :attr:`fragment` | 1 | Fragment identifier | empty string |
262+
+------------------+-------+-------------------------+----------------------+
263+
264+
See section :ref:`urlparse-result-object` for more information on the result
265+
object.
266+
267+
.. versionchanged:: 3.2
268+
Result is a structured object rather than a simple 2-tuple
269+
270+
271+
Parsing ASCII Encoded Bytes
272+
---------------------------
273+
274+
The URL parsing functions were originally designed to operate on character
275+
strings only. In practice, it is useful to be able to manipulate properly
276+
quoted and encoded URLs as sequences of ASCII bytes. Accordingly, the
277+
URL parsing functions in this module all operate on :class:`bytes` and
278+
:class:`bytearray` objects in addition to :class:`str` objects.
279+
280+
If :class:`str` data is passed in, the result will also contain only
281+
:class:`str` data. If :class:`bytes` or :class:`bytearray` data is
282+
passed in, the result will contain only :class:`bytes` data.
283+
284+
Attempting to mix :class:`str` data with :class:`bytes` or
285+
:class:`bytearray` in a single function call will result in a
286+
:exc:`TypeError` being thrown, while attempting to pass in non-ASCII
287+
byte values will trigger :exc:`UnicodeDecodeError`.
288+
289+
To support easier conversion of result objects between :class:`str` and
290+
:class:`bytes`, all return values from URL parsing functions provide
291+
either an :meth:`encode` method (when the result contains :class:`str`
292+
data) or a :meth:`decode` method (when the result contains :class:`bytes`
293+
data). The signatures of these methods match those of the corresponding
294+
:class:`str` and :class:`bytes` methods (except that the default encoding
295+
is ``'ascii'`` rather than ``'utf-8'``). Each produces a value of a
296+
corresponding type that contains either :class:`bytes` data (for
297+
:meth:`encode` methods) or :class:`str` data (for
298+
:meth:`decode` methods).
299+
300+
Applications that need to operate on potentially improperly quoted URLs
301+
that may contain non-ASCII data will need to do their own decoding from
302+
bytes to characters before invoking the URL parsing methods.
303+
304+
The behaviour described in this section applies only to the URL parsing
305+
functions. The URL quoting functions use their own rules when producing
306+
or consuming byte sequences as detailed in the documentation of the
307+
individual URL quoting functions.
308+
309+
.. versionchanged:: 3.2
310+
URL parsing functions now accept ASCII encoded byte sequences
311+
312+
313+
.. _urlparse-result-object:
314+
315+
Structured Parse Results
316+
------------------------
317+
318+
The result objects from the :func:`urlparse`, :func:`urlsplit` and
319+
:func:`urldefrag`functions are subclasses of the :class:`tuple` type.
320+
These subclasses add the attributes listed in the documentation for
321+
those functions, the encoding and decoding support described in the
322+
previous section, as well as an additional method:
323+
324+
.. method:: urllib.parse.SplitResult.geturl()
325+
326+
Return the re-combined version of the original URL as a string. This may
327+
differ from the original URL in that the scheme may be normalized to lower
328+
case and empty components may be dropped. Specifically, empty parameters,
329+
queries, and fragment identifiers will be removed.
330+
331+
For :func:`urldefrag` results, only empty fragment identifiers will be removed.
332+
For :func:`urlsplit` and :func:`urlparse` results, all noted changes will be
333+
made to the URL returned by this method.
334+
335+
The result of this method remains unchanged if passed back through the original
336+
parsing function:
337+
338+
>>> from urllib.parse import urlsplit
339+
>>> url = 'HTTP://www.Python.org/doc/#'
340+
>>> r1 = urlsplit(url)
341+
>>> r1.geturl()
342+
'http://www.Python.org/doc/'
343+
>>> r2 = urlsplit(r1.geturl())
344+
>>> r2.geturl()
345+
'http://www.Python.org/doc/'
346+
347+
348+
The following classes provide the implementations of the structured parse
349+
results when operating on :class:`str` objects:
350+
351+
.. class:: DefragResult(url, fragment)
352+
353+
Concrete class for :func:`urldefrag` results containing :class:`str`
354+
data. The :meth:`encode` method returns a :class:`DefragResultBytes`
355+
instance.
356+
357+
.. versionadded:: 3.2
358+
359+
.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
360+
361+
Concrete class for :func:`urlparse` results containing :class:`str`
362+
data. The :meth:`encode` method returns a :class:`ParseResultBytes`
363+
instance.
364+
365+
.. class:: SplitResult(scheme, netloc, path, query, fragment)
366+
367+
Concrete class for :func:`urlsplit` results containing :class:`str`
368+
data. The :meth:`encode` method returns a :class:`SplitResultBytes`
369+
instance.
370+
371+
372+
The following classes provide the implementations of the parse results when
373+
operating on :class:`bytes` or :class:`bytearray` objects:
374+
375+
.. class:: DefragResultBytes(url, fragment)
376+
377+
Concrete class for :func:`urldefrag` results containing :class:`bytes`
378+
data. The :meth:`decode` method returns a :class:`DefragResult`
379+
instance.
380+
381+
.. versionadded:: 3.2
382+
383+
.. class:: ParseResultBytes(scheme, netloc, path, params, query, fragment)
384+
385+
Concrete class for :func:`urlparse` results containing :class:`bytes`
386+
data. The :meth:`decode` method returns a :class:`ParseResult`
387+
instance.
388+
389+
.. versionadded:: 3.2
390+
391+
.. class:: SplitResultBytes(scheme, netloc, path, query, fragment)
392+
393+
Concrete class for :func:`urlsplit` results containing :class:`bytes`
394+
data. The :meth:`decode` method returns a :class:`SplitResult`
395+
instance.
396+
397+
.. versionadded:: 3.2
398+
399+
400+
URL Quoting
401+
-----------
402+
403+
The URL quoting functions focus on taking program data and making it safe
404+
for use as URL components by quoting special characters and appropriately
405+
encoding non-ASCII text. They also support reversing these operations to
406+
recreate the original data from the contents of a URL component if that
407+
task isn't already covered by the URL parsing functions above.
245408

246409
.. function:: quote(string, safe='/', encoding=None, errors=None)
247410

@@ -322,8 +485,7 @@ The :mod:`urllib.parse` module defines the following functions:
322485
If it is a :class:`str`, unescaped non-ASCII characters in *string*
323486
are encoded into UTF-8 bytes.
324487

325-
Example: ``unquote_to_bytes('a%26%EF')`` yields
326-
``b'a&\xef'``.
488+
Example: ``unquote_to_bytes('a%26%EF')`` yields ``b'a&\xef'``.
327489

328490

329491
.. function:: urlencode(query, doseq=False, safe='', encoding=None, errors=None)
@@ -340,12 +502,13 @@ The :mod:`urllib.parse` module defines the following functions:
340502
the optional parameter *doseq* is evaluates to *True*, individual
341503
``key=value`` pairs separated by ``'&'`` are generated for each element of
342504
the value sequence for the key. The order of parameters in the encoded
343-
string will match the order of parameter tuples in the sequence. This module
344-
provides the functions :func:`parse_qs` and :func:`parse_qsl` which are used
345-
to parse query strings into Python data structures.
505+
string will match the order of parameter tuples in the sequence.
346506

347507
When *query* parameter is a :class:`str`, the *safe*, *encoding* and *error*
348-
parameters are sent the :func:`quote_plus` for encoding.
508+
parameters are passed down to :func:`quote_plus` for encoding.
509+
510+
To reverse this encoding process, :func:`parse_qs` and :func:`parse_qsl` are
511+
provided in this module to parse query strings into Python data structures.
349512

350513
.. versionchanged:: 3.2
351514
Query parameter supports bytes and string objects.
@@ -376,57 +539,3 @@ The :mod:`urllib.parse` module defines the following functions:
376539

377540
:rfc:`1738` - Uniform Resource Locators (URL)
378541
This specifies the formal syntax and semantics of absolute URLs.
379-
380-
381-
.. _urlparse-result-object:
382-
383-
Results of :func:`urlparse` and :func:`urlsplit`
384-
------------------------------------------------
385-
386-
The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
387-
subclasses of the :class:`tuple` type. These subclasses add the attributes
388-
described in those functions, as well as provide an additional method:
389-
390-
.. method:: ParseResult.geturl()
391-
392-
Return the re-combined version of the original URL as a string. This may differ
393-
from the original URL in that the scheme will always be normalized to lower case
394-
and empty components may be dropped. Specifically, empty parameters, queries,
395-
and fragment identifiers will be removed.
396-
397-
The result of this method is a fixpoint if passed back through the original
398-
parsing function:
399-
400-
>>> import urllib.parse
401-
>>> url = 'HTTP://www.Python.org/doc/#'
402-
403-
>>> r1 = urllib.parse.urlsplit(url)
404-
>>> r1.geturl()
405-
'http://www.Python.org/doc/'
406-
407-
>>> r2 = urllib.parse.urlsplit(r1.geturl())
408-
>>> r2.geturl()
409-
'http://www.Python.org/doc/'
410-
411-
412-
The following classes provide the implementations of the parse results:
413-
414-
.. class:: BaseResult
415-
416-
Base class for the concrete result classes. This provides most of the
417-
attribute definitions. It does not provide a :meth:`geturl` method. It is
418-
derived from :class:`tuple`, but does not override the :meth:`__init__` or
419-
:meth:`__new__` methods.
420-
421-
422-
.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
423-
424-
Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
425-
overridden to support checking that the right number of arguments are passed.
426-
427-
428-
.. class:: SplitResult(scheme, netloc, path, query, fragment)
429-
430-
Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
431-
overridden to support checking that the right number of arguments are passed.
432-

‎Doc/whatsnew/3.2.rst‎

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -573,6 +573,14 @@ New, Improved, and Deprecated Modules
573573
(Contributed by Rodolpho Eckhardt and Nick Coghlan, :issue:`10220`.)
574574

575575
.. XXX: Mention inspect.getattr_static (Michael Foord)
576+
.. XXX: Mention urllib.parse changes
577+
Issue 9873 (Nick Coghlan):
578+
- ASCII byte sequence support in URL parsing
579+
- named tuple for urldefrag return value
580+
Issue 5468 (Dan Mahn) for urlencode:
581+
- bytes input support
582+
- non-UTF8 percent encoding of non-ASCII characters
583+
Issue 2987 for IPv6 (RFC2732) support in urlparse
576584
577585
Multi-threading
578586
===============

0 commit comments

Comments
 (0)