Until Java 6, we had a constant time substring on String. In Java 7, why did they decide to go with copying char array - and degrading to linear time complexity - when something like StringBuilder was exactly meant for that?
5 Answers
Why they decided is discussed in Oracle bug #4513622 : (str) keeping a substring of a field prevents GC for object:
When you call String.substring as in the example, a new character array for storage is not allocated. It uses the character array of the original String. Thus, the character array backing the the original String can not be GC'd until the substring's references can also be GC'd. This is an intentional optimization to prevent excessive allocations when using substring in common scenarios. Unfortunately, the problematic code hits a case where the overhead of the original array is noticeable. It is difficult to optimize for both edges cases. Any optimization for space/size trade-offs are generally complex and can often be platform-specific.
There's also this note, noting that what once was an optimization had become a pessimization according to tests:
For a long time preparations and planing have been underway to remove the offset and count fields from java.lang.String. These two fields enable multiple String instances to share the same backing character buffer. Shared character buffers were an important optimization for old benchmarks but with current real world code and benchmarks it's actually better to not share backing buffers. Shared char array backing buffers only "win" with very heavy use of String.substring. The negatively impacted situations can include parsers and compilers however current testing shows that overall this change is beneficial.
Comments
If you have a long lived small substring of a short lived, large parent string, the large char[] backing the parent string will not be eligible for garbage collection until the small substring moves out of scope. This means a substring can take up much more memory than people expect.
The only time the Java 6 way performed significantly better was when someone took a large substring from a large parent string, which is a very rare case.
Clearly they decided that the tiny performance cost of this change was outweighed by the hidden memory problems caused by the old way. The determining factor is that the problem was hidden, not that there is a workaround.
2 Comments
It's just their crappy way of fixing some JVM garbage collection limitations.
Before Java 7, if we want to avoid the garbage collection not working issue, we can always copy the substring instead of keeping the subString reference. It was just an extra call to the copy constructor:
String smallStr = new String(largeStr.substring(0,2));
But now, we can no longer have a constant time subString. What a disaster.
4 Comments
String instances in the first place, which does already bear unnecessary copy operations even before calling substring. Since every CharsetDecoder, including those encapsulated in a Reader, operates on CharBuffer, that's your starting point. And it's already the solution, as it implements CharSequence, so you can pass it to tools like the regex pattern matching engine, and has copying free subSequence and slice operations. You only need to create the final match result strings. Even simple java.util.Scanner works that waysubSequence(startIndex: Int, endIndex: Int): CharSequence
char[].StringBuildershould solve such a problem, isn't it?StringBuilderlets you work around the problem once you're aware that it exists. It doesn't fix memory leaks in existing code though. This change fixes memory leaks in existing code and, since buffer copies are usually hardware supported, ends up not costing linear time for any substrings that fit within one virtual memory page.