HTML-Encoding UTF-8 Characters

It happens sometimes that a web page isnt using UTF-8, but theres a need to display UTF-8 data. Thankfully HTML offers encoding that allows displaying any arbitrary UTF-8 characters..

Object Partners

It happens sometimes that a web page isn’t using UTF-8, but there’s a need to display UTF-8 data. Thankfully HTML offers encoding that allows displaying any arbitrary UTF-8 characters (of course, if the font supports the character, but that’s another topic). Sadly, though, there aren’t any quick helpers, like the Apache Commons Lang StringEscapeUtils to do the encoding. StringEscapeUtils will translate double-quotes into ” and ampersands into & and other common entities, but it doesn’t seem to touch the UTF-8 characters.

For example, there may be a need to offer a language drop-down, and the decided-upon best way to do that is to offer each language in that language. Then rather than seeing “Japanese” in English, the user would see 日本語 and can recognize their desired language. While it displays in the browser as “Japanese” in Japanese, on the HTML page it’s presented as the encoded string 日本語.

If the HTML page isn’t being delivered in UTF-8, and the font used has the Unicode characters, the HTML-encoded string will display properly. Even if the page is delivered in UTF-8, the encoded characters will be displayed, so it’s a nice safety net. Plus it allows storage of UTF-8 characters in databases and on file systems or in file types that don’t support UTF-8 (since technically it’s all ASCII when encoded). Of course, the HTML-encoding is really only useful if the end-target is HTML, but it may be the case that the files used will be to serve HTML, like, well, HTML files.

Since all strings in Java are UTF-8, it’s easy to forget that a string may have characters that aren’t going to be displayed correctly once it reaches the browser. This little snippet will correct that gap. It can be used to encode strings going to a database or file, too. There’s no corresponding decode mechanism, but it’s pretty simple to pull apart the ampersand-octothorpe-number-semicolon strings to return to UTF-8; plus, curiously, these strings are usually decoded when received by a Servlet into UTF-8, if that’s where the application is working.

/\*\*
\* Takes UTF-8 strings and encodes non-ASCII as
\* ampersand-octothorpe-digits-semicolon
\* HTML-encoded characters
\*
\* @param string
\* @return HTML-encoded String
*/
private String htmlEncode(final String string) {
  final StringBuffer stringBuffer = new StringBuffer();
  for (int i = 0; i < string.length(); i++) {
    final Character character = string.charAt(i);
    if (CharUtils.isAscii(character)) {
      // Encode common HTML equivalent characters
      stringBuffer.append(
          StringEscapeUtils.escapeHtml4(character.toString()));
    } else {
      // Why isn't this done in escapeHtml4()?
      stringBuffer.append(
          String.format("&#x%x;",
              Character.codePointAt(string, i)));
    }
  }
  return stringBuffer.toString();
}

Share this Post

Related Blog Posts

JVM

Refactoring? Check your settings!

April 16th, 2013

When refactoring, be sure to check your application settings as they may no longer make sense.

Brendon Anderson
JVM

Validating Grails Configurations

April 2nd, 2013

When externalizing grails app configurations for multiple environments I want to ensure values are provided for all the required/expected properties. So I wrote a plugin to help.

Object Partners
JVM

Testing Examples for the Facebook SDK Grails Plugin

March 28th, 2013

Testing Examples for the Facebook SDK Grails Plugin

Object Partners

About the author