String interning can save storage memory and increase string comparison performance. category 'experiment', language C#, created 03-Sep-2009, version V1.0, by Luc Pattyn |
License: The author hereby grants you a worldwide, non-exclusive license to use and redistribute the files and the source code in the article in any way you see fit, provided you keep the copyright notice in place; when code modifications are applied, the notice must reflect that. The author retains copyright to the article, you may not republish or otherwise make available the article, in whole or in part, without the prior written consent of the author. Disclaimer: This work is provided “as is”, without any express or implied warranties or conditions or guarantees. You, the user, assume all risk in its use. In no event will the author be liable to you on any legal theory for any special, incidental, consequential, punitive or exemplary damages arising out of this license or the use of the work or otherwise. |
In .NET strings get interned sometimes in order to save memory space and/or CPU cycles. I decided to investigate this a bit.
The concept of string interning is wide spread; it is documented everywhere, including here. The basic idea is to avoid storing duplicate strings, saving memory for storing the strings, and maybe also cycles when string comparison starts with comparing the references.
I did not really know what is triggering the interning; is it applied all the time (which would have its cost) or under some circumstances only, or under user control? The answers as far as I gathered so far seem to be:
string.Intern(string)
.Some code peeking with Reflector also showed that a string comparison such as if (string1==string2)...
gets compiled exactly the same as if (string.Equals(string1, string2)) ...
so they both rely on
the methods provided by the string class. And both String.Compare() and String.Equals() start by comparing the
string reference: equal references are bound to result in equal values.
I decided to create a little program that generated, listed and compared a lot of strings, while measuring both
the memory footprint of those strings and the execution time. The memory footprint is obtained by comparing
Process.PeakWorkingSet64
both before and after the operation. Warning: this number can never decrease
(hence the "peak" in its name), so each run of the program can only generate one reliable result. The elapsed time
is measured using a Stopwatch instance.
The environment used is the Microsoft .NET Framework (version 2.0 or above) and the C# programming language. The code is pretty simple:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Text;
namespace StringInterning {
class Program {
static void Main(string[] args) {
int differences=1000;
int repeats=1000;
int action=0;
if (args.Length==1) int.TryParse(args[0].ToLower(), out action);
switch (action) {
case 1:
testStrings(differences*repeats, 1, false);
break;
case 2:
testStrings(differences, repeats, false);
break;
case 3:
testStrings(differences*repeats, 1, true);
break;
case 4:
testStrings(differences, repeats, true);
break;
case 5:
// a secret way to test all four at once (with inaccurate WS result though)
testStrings(differences, repeats, true);
testStrings(differences*repeats, 1, false);
testStrings(differences, repeats, false);
testStrings(differences*repeats, 1, true);
break;
default:
log("StringInterning needs one numeric argument in the range [1,4] to select the test case");
break;
}
Console.Write("Hit ENTER to terminate...");
Console.ReadLine();
}
private static void log(string s) {
Console.WriteLine(s);
}
private static void testStrings(int differences, int repeats, bool interning) {
GC.Collect();
Process process=Process.GetCurrentProcess();
process.Refresh();
long WS=process.PeakWorkingSet64;
string title="Generating "+differences.ToString("#,###")+" different strings";
if (repeats<=1) title+=" once"; else title+=" "+repeats.ToString("#,###")+" times";
title+=interning?" with interning":" without interning";
log(title);
List<string> list=new List<string>(differences*repeats);
Stopwatch stopwatch=new Stopwatch();
stopwatch.Start();
for (int repeat=0; repeat<repeats; repeat++) {
for (int count=0; count<differences; count++) {
string s=string.Format("text line {0}", count);
if (interning) s=string.Intern(s);
list.Add(s);
}
}
stopwatch.Stop();
GC.Collect();
process.Refresh();
WS=(process.PeakWorkingSet64-WS)/1024/1024;
log("took "+stopwatch.Elapsed.TotalMilliseconds.ToString("#,###")+" msec and required "+WS+" MB");
stopwatch.Reset();
stopwatch.Start();
long matches=0;
int listCount=list.Count;
for (int i1=0; i1<listCount; i1+=100) {
string s1=list[i1];
for (int i2=0; i2<listCount; i2+=100) {
string s2=list[i2];
if (s1==s2) matches++;
}
}
stopwatch.Stop();
log("Comparing 1% of them took "+stopwatch.Elapsed.TotalMilliseconds.ToString("#,###")+
" msec, and found "+matches.ToString("#,###")+" matches");
}
}
}
And the results were (always generating 1 million strings and comparing 1% of them):
action differences repeats generationTime storage comparisonTime matches
1 1,000 1000 1.3 sec 68 MB 12 sec 10,000
2 1,000,000 1 1.3 sec 68 MB 12 sec 10,000,000
3 1,000 1000 2.4 sec 96 MB 12 sec 10,000
4 1,000,000 1 1.0 sec 5 MB 2 sec 10,000,000
String interning works: it makes sense to call string.Intern()
explicitly provided there will be
enough duplicates to amortize the effort; the storage can go down and so can the comparison time, provided enough
matches are present in the set.
Perceler |
Copyright © 2012, Luc Pattyn |
Last Modified 02-Sep-2013 |