Tuesday, March 11, 2008

All about intern()

Posted by enicholas on June 26, 2006 at 02:16 PM

Strings are a fundamental part of any modern programming language, every bit as important as numbers. So you'd think that Java programmers would go out of their way to have a solid understanding of them -- and sadly, that isn't always the case.

I was going through the source code to Xerces (the XML parser included in Java) today, when I found a very surprising line:

com.sun.org.apache.xerces.internal.impl.XMLScanner:395
protected final static String fVersionSymbol = "version".intern();

There are a number of strings defined like this, and every one of them is being interned. So what exactly is intern()? Well, as you no doubt know, there are two different ways to compare objects in Java. You can use the == operator, or you can use the equals() method. The == operator compares whether two references point to the same object, whereas the equals() method compares whether two objects contain the same data.

One of the first lessons you learn in Java is that you should usually use equals(), not ==, to compare two strings. If you compare, say, new String("Hello") == new String("Hello"), you will in fact receive false, because they are two different string instances. If you use equals() instead, you will receive true, just as you'd expect. Unfortunately, the equals() method can be fairly slow, as it involves a character-by-character comparison of the strings.

Since the == method compares identity, all it has to do is compare two pointers to see if they are the same, and obviously it will be much faster than equals(). So if you're going to be comparing the same strings repeatedly, you can get a significant performance advantage by reducing it to an identity comparison rather than an equality comparison. The basic algorithm is:

1) Create a hash set of Strings
2) Check to see if the String you're dealing with is already in the set
3) If so, return the one from the set
4) Otherwise, add this string to the set and return it

After following this algorithm, you are guaranteed that if two strings contain the same characters, they are also the same instance. This means that you can safely compare strings using == rather than equals(), gaining a significant performance advantage with repeated comparisons.

Fortunately, Java already includes an implementation of the algorithm above. It's the intern() method on java.lang.String. new String("Hello").intern() == new String("Hello").intern() returns true, whereas without the intern() calls it returns false.

So why was I so surprised to see protected final static String fVersionSymbol = "version".intern(); in the Xerces source code? Obviously this string will be used for many comparisons, doesn't it make sense to intern it?

Sure it does. That's why Java already does it. All constant strings that appear in a class are automatically interned. This includes both your own constants (like the above "version" string) as well as other strings that are part of the class file format -- class names, method and field signatures, and so forth. It even extends to constant string expressions: "Hel" + "lo" is processed by javac exactly the same as "Hello", and "Hel" + "lo" == "Hello" will return true.

So the result of calling intern() on a constant string like "version" is by definition going to be the exact same string you passed in. "version" == "version".intern(), always. You only need to intern strings when they are not constants, and you want to be able to quickly compare them to other interned strings.

There can also be a memory advantage to interning strings -- you only keep one copy of the string's characters in memory, no matter how many times you refer to it. That's the main reason why class file constant strings are interned: think about how many classes refer to (say) java.lang.Object. The name of the class java.lang.Object has to appear in every single one of those classes, but thanks to the magic of intern(), it only appears in memory once.

The bottom line? intern() is a useful method and can make life easier -- but make sure that you're using it responsibly.

0 comments: