Python Strings are immutable so why does s.split( ) return a list of new strings

Question

By looking at the CPython implementation it seems the return value of a string split() is a list of newly allocated strings. However, since strings are immutable it seems one could have made substrings out of the original string by pointing at the offsets.

Am I understanding the current behavior of CPython correctly ? Are there reasons for not opting for this space optimization ? One reason I can think of is that the parent string cannot be freed until all its substrings are.

What does the CPython string look like? Is it NULL terminated? — JohanL
– JohanL, Commented May 8, 2017 at 6:29
GvR has rejected view slices for strings because of the problem of space recovery. Numerical python does use view slices because there is usually just a few master arrays which needs to be kept around awhile anyway. If you are in a similar position with a few master strings and many slices, you can write a substring view class. There may even be one on PyPI. — Terry Jan Reedy
– Terry Jan Reedy, Commented May 13, 2017 at 23:23

Community · Accepted Answer · 2017-05-23 12:10:43Z

6

Without a crystal ball I can't tell you why CPython does it that way. However, there are some reasons why you might choose to do it that way.

The problem is that a small string might hold a reference to a much larger backing array. For example, suppose I read in a 8 GB HTTP access log file to analyze which user agents access my file the most, and I do that just by fp.read() and then run a regex on the whole file at once rather than going one line at a time.

I want to know about the top 10 most common user agents, so I keep this around in a list.

Then I want to do the same analysis for 100 other files, to see how the top 10 user agents have changed over time. Boom! My program is trying to use 800 GB of memory and gets killed. Why? How do I debug this?

Java used this sharing technique prior to Java 7, so the same reasoning applies. See Java 7 String - substring complexity and JDK-4513622: (str) keeping a substring of a field prevents GC for object.

Also note that having strings share memory would require you to follow a pointer from the string object to the string data. In CPython, the string data is usually placed directly after a header in memory, so you don't need to follow a pointer. This reduces the number of allocations required and reduces data dependencies when reading strings.

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered May 8, 2017 at 6:32

Dietrich Epp

216k39 gold badges366 silver badges426 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

san Over a year ago

Yeah this probably the reason (I anticipated in my question).

monoceres Over a year ago

Interesting, never thought of that. Seems like a problem that should/could be fixed at the GC level by just splitting the string into new logical memory blocks though.

Dietrich Epp Over a year ago

@monoceres: Perhaps you could write a GC implementation for us that would do that... it sounds like no small task.

Dietrich Epp Over a year ago

@monoceres For reference, when Oracle removed string sharing from Java, they had GC experts working on improving the GC for almost 20 years and had what is in my mind the best GC on the planet, bar none.

Dietrich Epp Over a year ago

@Kasramvd: That kind of misses the point of the example... it illustrates how a problem could arise. But there are a couple of other reasons why the example is reasonable. You may want to do multiple passes of analysis, so using fp as an iterator would make you do twice as much IO. Additionally, even though file iteration is fast, doing a regex or string search on a contiguous buffer is faster than doing it line-by-line.

|

Community · Accepted Answer · 2017-05-23 12:26:33Z

3

In the current CPython implementation, strings are reference-counted; it is assumed that a string cannot hold references to other objects because a string is not a container. This means that garbage collection does not need to inspect or trace over string objects (because they're entirely covered by the reference counting). But it's actually worse than that: Old versions of Python did not have a tracing garbage collector at all; GC was new in 2.0. Before that, any cyclic garbage would simply leak.

A competently-implemented substring-to-offset algorithm should not form cycles. So in theory, a cyclic garbage collector is not a prerequisite for this. However, because we're doing reference counting instead of tracing, the child objects become responsible for Py_DECREF()ing their parent objects at end-of-life. Otherwise the parent leaks. This means you cannot just chuck the whole string into the free list when it reaches end-of-life; you have to check whether it's a substring, and branching is potentially expensive. Python was historically designed to do string processing (like Perl, but with nicer syntax), which means creating and destroying a lot of strings. Furthermore, all variable names are internally stored as strings, so even if the user is not doing string processing, the interpreter is. Slowing down the string deallocation process by even a little could have a serious impact on performance.

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered May 8, 2017 at 7:02

Kevin

30.4k11 gold badges67 silver badges89 bronze badges

3 Comments

san Over a year ago

That was very informative, thank you. I assumed live substrings could just bump up the reference count of the parent string. So the problem is reference counting is not a very efficient way of doing garbage collection: trades of throughput for latency.

Dietrich Epp Over a year ago

Typical implementation is to have one type for string slices and one type for the backing array, so you wouldn't ever worry about cycles and you wouldn't need anything more than reference-counting. Branching may be expensive but reference-counting requires branching already, and free() is much more expensive than a branch to begin with.

Kevin Over a year ago

@Dietrich: You got me, there's actually a free list. Adding an extra branch really would make a difference.

Andrew Dalke · Accepted Answer · 2017-05-14 11:14:54Z

CPython internally uses NUL-terminated strings in addition to storing a length. This is a very early design choice, present since the very first version of Python, and still true in the latest version.

You can see that in Include/unicodeobject.h where PyASCIIObject says "wchar_t representation (null-terminated)" and PyCompactUnicodeObject says "UTF-8 representation (null-terminated)". (Recent CPython implementations select from one of 4 back-end string types, depending on the Unicode encoding needs.)

Many Python extension modules expect a NUL terminated string. It would be difficult to implement substrings as slices into a larger string and preserve the low-level C API. Not impossible, as it could be done using a copy-on-C-API-access. Or Python could require all extension writers to use a new subslice-friendly API. But that complexity is not worthwhile given the problems found from experience in other languages which implement subslice references, as Dietrich Epp described.

I see little in Kevin's answer which is applicable to this question. The decision had nothing do to with the lack of circular garbage collection before Python 2.0, nor could it. Substring slices are implemented with an acyclic data structure. 'Competently-implemented' isn't a relevant requirement as it would take a perverse sort of incompetence or malice to turn it into a cyclic data structure.

Nor would there necessarily be extra branch overhead in the deallocator. If the source string were one type and the substring slice another type, then Python's normal type dispatcher would automatically use the correct deallocator, with no additional overhead. Even if there were an extra branch, we know that branching overhead in this case is not "expensive". Python 3.3 (because of PEP 393) has those 4 back-end Unicode types, and decides what to do based on branching. String access occurs much more often than deallocation, so any dellocation overhead due to branching would be lost in the noise.

It is mostly true that in CPython "variable names are internally stored as strings". (The exception is that local variables are stored as indices into a local array.) However, these names are also interned into a global dictionary using PyUnicode_InternInPlace(). There is therefore no deallocation overhead because these strings are not deallocated, outside of cases involving dynamic dispatch using non-interned strings, like through getattr().

Collectives™ on Stack Overflow

Python Strings are immutable so why does s.split( ) return a list of new strings

3 Answers 3

8 Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related