Strip /uXXXX From String and Replace it With the Correct Unicode Character

×

This post was originally published in 2009
It may contain stale & outdated information. Or it may have grown more awesome with age, like the author.

About a month ago, when reading DBPedia data into a database, I discovered ‘/uXXXX’ appearing where pretty unicode characters should be within my strings. The strings were to be compared to … other strings, which would have the proper unicode characters, so I had to replace the ‘/uXXXX’ in my strings. I couldn’t find a class to do this, but found enough information to understand what needed to be done.

The below function is what I came up with.

/**
 * Strips /uXXXX from a string and replaces it with the correct unicode character (for example: '\u1E09')
 * 
 * @param slashed string containing '/uXXXX' to be replaced with their Unicode characters
 * @return Unicode string with '/uXXXX' converted into Unicode.
 * @author Michael Robinson mike@pagesofinterest.net
 */
public String unslashUnicode(String slashed){
 
	ArrayList<String> pieces = new ArrayList<String>();
 
	while(true){//while there is /uXXXX in the string
 
		if(slashed.contains("\\u")){
 
			pieces.add(slashed.substring(0,slashed.indexOf("\\u")));//add the bit before the /uXXXX
 
			char c = (char) Integer.parseInt(slashed.substring(slashed.indexOf("\\u")+2,slashed.indexOf("\\u")+6), 16);
 
			slashed = slashed.substring(slashed.indexOf("\\u")+6,slashed.length());
 
			pieces.add(c+"");//add the  unicode
		}
		else{
			break;
		}
	}
	String temp = "";
 
	for(String s : pieces){
		temp = temp + s;//put humpty dumpty back together again
	}
	slashed = temp + slashed;
 
	return slashed;
}

Note that my strings only ever contained unicode slashed as ‘/uXXX’, never as ‘/UXXXX’. The above class, therefore, will need some modification if it is to be used with capital ‘u’ slashed unicode characters.

Comments (2) | Trackback