Inconsistent Behavior of String Methods for Special Unicode Characters
Take a look at the following function:
string StripSpaces(string input)
{
while (input.IndexOf(" ") >= 0)
{
input = input.Replace(" ", " ");
}
return input;
}
Can you think of an input causing it to loop infinitely? No? Try calling it like this, then:
var result = StripSpaces(" \ufffd ");
Yes, in this case the method actually never returns because of the special 0xFFFD Unicode replacement character. It seems that different String
methods handle it in different ways:
IndexOf()
ignores it when searching for patterns therefore it finds two spaces in the above input stringReplace()
is aware of it and therefore doesn't replace the two spaces with a single one, keeping the string unchanged and causing an infinite loop in the above method.
When will you encounter the replacement character? Typically it is returned when a file contains an invalid byte value for the given encoding.
What can you do about it? Strip it from the input string like this:
var input = input.Replace("\ufffd", "")
Lesson of the day? Never trust your input.