|
8 | 8 |
|
9 | 9 | This is used to optimize dict and attribute lookups, among other things. |
10 | 10 |
|
11 | | -Python uses three different mechanisms to intern strings: |
| 11 | +Python uses two different mechanisms to intern strings: singletons and |
| 12 | +dynamic interning. |
12 | 13 |
|
13 | | -- Singleton strings marked in C source with `_Py_STR` and `_Py_ID` macros. |
14 | | - These are statically allocated, and collected using `make regen-global-objects` |
15 | | - (`Tools/build/generate_global_objects.py`), which generates code |
16 | | - for declaration, initialization and finalization. |
| 14 | +## Singletons |
17 | 15 |
|
18 | | - The difference between the two kinds is not important. (A `_Py_ID` string is |
19 | | - a valid C name, with which we can refer to it; a `_Py_STR` may e.g. contain |
20 | | - non-identifier characters, so it needs a separate C-compatible name.) |
| 16 | +The 256 possible one-character latin-1 strings, which can be retrieved with |
| 17 | +`_Py_LATIN1_CHR(c)`, are stored in statically allocated arrays, |
| 18 | +`_PyRuntime.static_objects.strings.ascii` and |
| 19 | +`_PyRuntime.static_objects.strings.latin1`. |
21 | 20 |
|
22 | | - The empty string is in this category (as `_Py_STR(empty)`). |
| 21 | +Longer singleton strings are marked in C source with `_Py_ID` (if the string |
| 22 | +is a valid C identifier fragment) or `_Py_STR` (if it needs a separate |
| 23 | +C-compatible name.) |
| 24 | +These are also stored in statically allocated arrays. |
| 25 | +They are collected from CPython sources using `make regen-global-objects` |
| 26 | +(`Tools/build/generate_global_objects.py`), which generates code |
| 27 | +for declaration, initialization and finalization. |
23 | 28 |
|
24 | | - These singletons are interned in a runtime-global lookup table, |
25 | | - `_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
26 | | - at runtime initialization. |
| 29 | +The empty string is one of the singletons: `_Py_STR(empty)`. |
27 | 30 |
|
28 | | -- The 256 possible one-character latin-1 strings are singletons, |
29 | | - which can be retrieved with `_Py_LATIN1_CHR(c)`, are stored in runtime-global |
30 | | - arrays, `_PyRuntime.static_objects.strings.ascii` and |
31 | | - `_PyRuntime.static_objects.strings.latin1`. |
| 31 | +The three sets of singletons (`_Py_LATIN1_CHR`, `_Py_ID`, `_Py_STR`) |
| 32 | +are disjoint. |
| 33 | +If you have such a singleton, it (and no other copy) will be interned. |
32 | 34 |
|
33 | | - These are NOT interned at startup in the normal build. |
34 | | - In the free-threaded build, they are; this avoids modifying the |
35 | | - global lookup table after threads are started. |
| 35 | +These singletons are interned in a runtime-global lookup table, |
| 36 | +`_PyRuntime.cached_objects.interned_strings` (`INTERNED_STRINGS`), |
| 37 | +at runtime initialization, and immutable until it's torn down |
| 38 | +at runtime finalization. |
| 39 | +It is shared across threads and interpreters without any synchronization. |
36 | 40 |
|
37 | | - Interning a one-char latin-1 string will always intern the corresponding |
38 | | - singleton. |
39 | 41 |
|
40 | | -- All other strings are allocated dynamically, and have their |
41 | | - `_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
42 | | - When interned, such strings are added to an interpreter-wide dict, |
43 | | - `PyInterpreterState.cached_objects.interned_strings`. |
| 42 | +## Dynamically allocated strings |
44 | 43 |
|
45 | | - The key and value of each entry in this dict reference the same object. |
| 44 | +All other strings are allocated dynamically, and have their |
| 45 | +`_PyUnicode_STATE(s).statically_allocated` flag set to zero. |
| 46 | +When interned, such strings are added to an interpreter-wide dict, |
| 47 | +`PyInterpreterState.cached_objects.interned_strings`. |
46 | 48 |
|
47 | | -The three sets of singletons (`_Py_STR`, `_Py_ID`, `_Py_LATIN1_CHR`) |
48 | | -are disjoint. |
49 | | -If you have such a singleton, it (and no other copy) will be interned. |
| 49 | +The key and value of each entry in this dict reference the same object. |
50 | 50 |
|
51 | 51 |
|
52 | 52 | ## Immortality and reference counting |
53 | 53 |
|
54 | | -Invariant: Every immortal string is interned, *except* the one-char latin-1 |
55 | | -singletons (which might but might not be interned). |
| 54 | +Invariant: Every immortal string is interned. |
56 | 55 |
|
57 | 56 | In practice, this means that you must not use `_Py_SetImmortal` on |
58 | 57 | a string. (If you know it's already immortal, don't immortalize it; |
@@ -115,8 +114,5 @@ The valid transitions between these states are: |
115 | 114 | Using `_PyUnicode_InternStatic` on these is an error; the other cases |
116 | 115 | don't change the state. |
117 | 116 |
|
118 | | -- One-char latin-1 singletons can be interned (0 -> 3) using any interning |
119 | | - function; after that the functions don't change the state. |
120 | | - |
121 | | -- Other statically allocated strings are interned (0 -> 3) at runtime init; |
| 117 | +- Singletons are interned (0 -> 3) at runtime init; |
122 | 118 | after that all interning functions don't change the state. |
0 commit comments