@@ -7,11 +7,14 @@ \chapter{Lexical analysis\label{lexical}}
77\index {parser}
88\index {token}
99
10- Python uses the 7-bit \ASCII {} character set for program text and string
11- literals. 8-bit characters may be used in string literals and comments
12- but their interpretation is platform dependent; the proper way to
13- insert 8-bit characters in string literals is by using octal or
14- hexadecimal escape sequences.
10+ Python uses the 7-bit \ASCII {} character set for program text.
11+ \versionadded [An encoding declaration can be used to indicate that
12+ string literals and comments use an encoding different from ASCII.]{2.3}
13+ For compatibility with older versions, Python only warns if it finds
14+ 8-bit characters; those warnings should be corrected by either declaring
15+ an explicit encoding, or using escape sequences if those bytes are binary
16+ data, instead of characters.
17+
1518
1619The run-time character set depends on the I/O devices connected to the
1720program but is generally a superset of \ASCII .
@@ -69,6 +72,37 @@ \subsection{Comments\label{comments}}
6972\index {hash character}
7073
7174
75+ \subsection {Encoding declarations\label {encodings } }
76+
77+ If a comment in the first or second line of the Python script matches
78+ the regular expression "coding[=:]\s*([\w-_.]+)" , this comment is
79+ processed as an encoding declaration; the first group of this
80+ expression names the encoding of the source code file. The recommended
81+ forms of this expression are
82+
83+ \begin {verbatim }
84+ # -*- coding: <encoding-name> -*-
85+ \end {verbatim }
86+
87+ which is recognized also by GNU Emacs, and
88+
89+ \begin {verbatim }
90+ # vim:fileencoding=<encoding-name>
91+ \end {verbatim }
92+
93+ which is recognized by Bram Moolenar's VIM. In addition, if the first
94+ bytes of the file are the UTF-8 signature ($ '\xef\xbb\xbf '$ ), the
95+ declared file encoding is UTF-8 (this is supported, among others, by
96+ Microsoft's notepad.exe).
97+
98+ If an encoding is declared, the encoding name must be recognized by
99+ Python. % XXX there should be a list of supported encodings.
100+ The encoding is used for all lexical analysis, in particular to find
101+ the end of a string, and to interpret the contents of Unicode literals.
102+ String literals are converted to Unicode for syntactical analysis,
103+ then converted back to their original encoding before interpretation
104+ starts.
105+
72106\subsection {Explicit line joining\label {explicit-joining } }
73107
74108Two or more physical lines may be joined into logical lines using
0 commit comments