Java 7 String - substring complexity

Question

Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?

To avoid having a string with small length prevent garbage collection of an arbitrarily large char[]. — Mike Samuel
– Mike Samuel, Commented Apr 20, 2013 at 17:56
Using StringBuilder should solve such a problem, isn't it? — anoopelias
– anoopelias, Commented Apr 20, 2013 at 18:00
Using StringBuilder lets you work around the problem once you're aware that it exists. It doesn't fix memory leaks in existing code though. This change fixes memory leaks in existing code and, since buffer copies are usually hardware supported, ends up not costing linear time for any substrings that fit within one virtual memory page. — Mike Samuel
– Mike Samuel, Commented Apr 20, 2013 at 19:03

Community · Accepted Answer · 2023-11-13 19:27:07Z

Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:

When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.

There's also this note, noting that what once was an optimization had become a pessimization according to tests:

For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.

ILMTitan · Accepted Answer · 2013-04-20 18:19:41Z

9

If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.

The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.

Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.

edited Apr 20, 2013 at 18:19

answered Apr 20, 2013 at 18:07

ILMTitan

11k3 gold badges32 silver badges48 bronze badges

2 Comments

Don Over a year ago

trim() takes a large substring from a large parent string and is used all the time.

WestCoastProjects Over a year ago

Encountering damaged performance to algorithms due to this (poor..) design decision is a common occurrence not a rare one.

Heisenberg · Accepted Answer · 2014-02-17 13:53:58Z

5

This will impact the complexity of data structures like Suffix Arrays by a fair margin. Java should provide some alternate method for getting a part of the original string.

answered Feb 17, 2014 at 13:53

Heisenberg

5,6783 gold badges38 silver badges44 bronze badges

Comments

Alex · Accepted Answer · 2015-10-18 04:56:45Z

5

It's just their crappy way of fixing some JVM garbage collection limitations.

Before Java 7, if we want to avoid the garbage collection not working issue, we can always copy the substring instead of keeping the subString reference. It was just an extra call to the copy constructor:

String smallStr = new String(largeStr.substring(0,2));

But now, we can no longer have a constant time subString. What a disaster.

edited Oct 18, 2015 at 4:56

answered Apr 18, 2015 at 5:29

Alex

3,0165 gold badges31 silver badges40 bronze badges

4 Comments

WestCoastProjects Over a year ago

This is completely true. It is many types of programs that had benefited from shared substring usage. The compilers and parsers are a good illustration of the type of operations that are most hurt: but the damage extends well beyond those specific types of programs.

Simon Berthiaume Over a year ago

Is anyone aware of any 3rd party lib/code with a custom implementation of CharSequence (or something similar) that replicates the "old" substring behavior? I often have to process large CSV-like files (500+MB) and whenever I profile them I realize at least 10% of the processing time seems to be wasted in calls to Arrays.copyOfRange().

Holger Over a year ago

@SimonBerthiaume the performance mistake is to create String instances in the first place, which does already bear unnecessary copy operations even before calling substring. Since every CharsetDecoder, including those encapsulated in a Reader, operates on CharBuffer, that's your starting point. And it's already the solution, as it implements CharSequence, so you can pass it to tools like the regex pattern matching engine, and has copying free subSequence and slice operations. You only need to create the final match result strings. Even simple java.util.Scanner works that way

SL5net Over a year ago

now we could use subSequence(startIndex: Int, endIndex: Int): CharSequence

ZhongYu · Accepted Answer · 2015-06-09 18:52:59Z

1

The main motivation, I believe, is the eventual "co-location" of String and its char[]. Right now they locate in a distance, which is a major penalty on cache lines. If every String owns its char[], JVM can merge them together, and reading will be much faster.

answered Jun 9, 2015 at 18:52

ZhongYu

19.8k5 gold badges35 silver badges61 bronze badges

Collectives™ on Stack Overflow

Java 7 String - substring complexity

5 Answers 5

Comments

2 Comments

Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

2 Comments

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related