Strip /uXXXX From String and Replace it With the Correct Unicode Character


This post was originally published in 2009
It may contain stale & outdated information. Or it may have grown more awesome with age, like the author.

About a month ago, when reading DBPedia data into a database, I discovered ‘/uXXXX’ appearing where pretty unicode characters should be within my strings. The strings were to be compared to … other strings, which would have the proper unicode characters, so I had to replace the ‘/uXXXX’ in my strings. I couldn’t find a class to do this, but found enough information to understand what needed to be done.

The below function is what I came up with.

 * Strips /uXXXX from a string and replaces it with the correct unicode character (for example: '\u1E09')
 * @param slashed string containing '/uXXXX' to be replaced with their Unicode characters
 * @return Unicode string with '/uXXXX' converted into Unicode.
 * @author Michael Robinson
public String unslashUnicode(String slashed){
	ArrayList<String> pieces = new ArrayList<String>();
	while(true){//while there is /uXXXX in the string
			pieces.add(slashed.substring(0,slashed.indexOf("\\u")));//add the bit before the /uXXXX
			char c = (char) Integer.parseInt(slashed.substring(slashed.indexOf("\\u")+2,slashed.indexOf("\\u")+6), 16);
			slashed = slashed.substring(slashed.indexOf("\\u")+6,slashed.length());
			pieces.add(c+"");//add the  unicode
	String temp = "";
	for(String s : pieces){
		temp = temp + s;//put humpty dumpty back together again
	slashed = temp + slashed;
	return slashed;

Note that my strings only ever contained unicode slashed as ‘/uXXX’, never as ‘/UXXXX’. The above class, therefore, will need some modification if it is to be used with capital ‘u’ slashed unicode characters.

Comments (6) | Trackback