9

By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.

Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.

3
  • 1
    What does the CPython string look like? Is it NULL terminated? Commented May 8, 2017 at 6:29
  • @Kasramvd Read it as offsets. Editing out the redundancy Commented May 8, 2017 at 6:32
  • GvR has rejected view slices for strings because of the problem of space recovery. Numerical python does use view slices because there is usually just a few master arrays which needs to be kept around awhile anyway. If you are in a similar position with a few master strings and many slices, you can write a substring view class. There may even be one on PyPI. Commented May 13, 2017 at 23:23

3 Answers 3

6

Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.

The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.

I want to know about the top 10 most common user agents, so I keep this around in a list.

Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?

Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str) keeping a substring of a field prevents GC for object.

Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.

Sign up to request clarification or add additional context in comments.

8 Comments

Yeah this probably the reason (I anticipated in my question).
Interesting, never thought of that. Seems like a problem that should/could be fixed at the GC level by just splitting the string into new logical memory blocks though.
@monoceres: Perhaps you could write a GC implementation for us that would do that... it sounds like no small task.
@monoceres For reference, when Oracle removed string sharing from Java, they had GC experts working on improving the GC for almost 20 years and had what is in my mind the best GC on the planet, bar none.
@Kasramvd: That kind of misses the point of the example... it illustrates how a problem could arise. But there are a couple of other reasons why the example is reasonable. You may want to do multiple passes of analysis, so using fp as an iterator would make you do twice as much IO. Additionally, even though file iteration is fast, doing a regex or string search on a contiguous buffer is faster than doing it line-by-line.
|
3

In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.

A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.

3 Comments

That was very informative, thank you. I assumed live substrings could just bump up the reference count of the parent string. So the problem is reference counting is not a very efficient way of doing garbage collection: trades of throughput for latency.
Typical implementation is to have one type for string slices and one type for the backing array, so you wouldn't ever worry about cycles and you wouldn't need anything more than reference-counting. Branching may be expensive but reference-counting requires branching already, and free() is much more expensive than a branch to begin with.
@Dietrich: You got me, there's actually a free list. Adding an extra branch really would make a difference.
0

CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.

You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)

Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.

I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.

Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.

It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.