Professional Freelance Web Developer
CodeIgniter Activist, Caffeine Junkie

Elliot Haughin

Building UTF8 Compatible CodeIgniter Applications

Building UTF8 Compatible CodeIgniter Applications
23rd February 2010 / Image: (cc) Roche Photo

UTF8 allows your site to represent characters other than those in the basic english alphabet. More often than not, your CodeIgniter Application will contain methods where users can enter their name. This is where you’ll most commonly see unusual characters cropping up. To make sure your site can properly represent all of these to the browser, you need to use UTF8 encoding, a way of encoding any unicode character into 1-4 bytes of data.

This guide assumes you are reasonably competent in installing php extensions, adding config variables to your php.ini, and using MY_ CodeIgniter overloading. If you’re not sure about any of these, please make sure you consult a professional.

PHP

PHP has a few issues whilst using UTF8. Because it encodes each character using a variable length, some characters can become longer than one byte. There are a few ‘multibyte unsafe operations’ in PHP which do not detect characters greater than one byte in length. To fix this, we can use mbstring.

While there are many languages in which every necessary character can be represented by a one-to-one mapping to an 8-bit value, there are also several languages which require so many characters for written communication that they cannot be contained within the range a mere byte can code (A byte is made up of eight bits. Each bit can contain only two distinct values, one or zero. Because of this, a byte can only represent 256 unique values (two to the power of eight)). Multibyte character encoding schemes were developed to express more than 256 characters in the regular bytewise coding system.

When you manipulate (trim, split, splice, etc.) strings encoded in a multibyte encoding, you need to use special functions since two or more consecutive bytes may represent a single character in such encoding schemes. Otherwise, if you apply a non-multibyte-aware string function to the string, it probably fails to detect the beginning or ending of the multibyte character and ends up with a corrupted garbage string that most likely loses its original meaning.

mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience (listed below).

Once you’ve installed, configured and enabled this extension, several core php functions will be automatically overloaded by mbstring. Here’s a list of the functions automatically overloaded

You can install mbstring a few different ways, for simplicity, here’s how you would do it using a package manager (Aptitude) with Ubunutu:

sudo apt-get install php5-mbstring

Once the installation is complete, you’ll need to enable the extension and add some configuration to your php.ini file:

extension=php_mbstring.so
[mbstring]
mbstring.language     = Neutral   ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding   = UTF-8     ; Set default internal encoding to UTF-8
mbstring.encoding_translation   = On     ;  HTTP input encoding translation is enabled
mbstring.http_input     = auto     ; Set HTTP input character set dectection to auto
mbstring.http_output     = UTF-8     ; Set HTTP output encoding to UTF-8
mbstring.detect_order     = auto     ; Set default character encoding detection order to auto
mbstring.substitute_character   = none     ; Do not print invalid characters
default_charset         = UTF-8     ; Default
mbstring.func_overload = 7

With that out of the way, restart your webserver, and you’re set up with mbstring.

Database

Because the database may be storing some of our unsual characters, we also need to select an encoding that will support it. On your database, change the encoding type to a UTF-8 compatible one. Under MySQL, I tend to use: “utf8_general_ci”

CodeIgniter

We all know and love CodeIgniter, our favourite PHP Framework. But, by default there’s some places that CodeIgniter just doesn’t deal with UTF-8 encoded characters very well. But, because CodeIgniter is almost completely flexible, we can modify some of the core methods and have them overload the default methods, fixing the problems.

If you’re using the form_helper, you’ll need to modify the form_open method to tell the form to use UTF-8 encoding. This new file will also fix a potential issue with htmlspecialchars.

Create the file: application/helpers/MY_form_helper.php containing:

Now we move on to the slightly more tricky stuff. The XMLRPC library bundled with CodeIgniter has some code that isn’t safe with UTF-8. In particular, htmlentities and htmlspecialchars. This file is pretty big, so I won’t put it all here.

MY_Xmlrpc.php

Place this file in application/libraries/MY_Xmlrpc.php

The email library has a similar problem, so here’s a version of the email library with the fix in place:

MY_Email.php

Place this file in application/libraries/MY_Email.php

Now, finally, before we go ahead and build our application, we need to set the header.php in our view files (or your version of this) to use UTF-8:

So, you should now be ready to build your UTF-8 web application in CodeIgniter, I’m not going to leave you without an extra helper to make the journey a little easier. First of all, create a new config file:

application/config/international.php

Now create a new helper:

application/helpers/international.php

There’s some helpful methods in here like: utf8_to_uri() which takes characters and converts them into standard alphabet characters:

 
$utf8 = "Iñtërnâtiônàlizætiøn";
echo utf8_to_uri($uft8);
 
// Produces: (string) "Internationalization"

As you can see, unicode isn’t exactly straight forward. But, if you want to build a true international well application, it certainly is important!

  • Sami Törölä
    Nice to see you back, with quality content as always! I was just struggling with the character encoding issues and your function helped to debug the problem, nice timing. There also seems to be problem in CI's text_helper atleast character_limiter produces invalid UTF-8 strings. I did quick fix the get things running, but it would need proper fixes too.
  • @Elliot: Good article dude, you picked up on quite a few bits I missed. It might be worth mentioning that the MY_form_helper.php is already included in the SVN version of CodeIgniter. I submitted it myself. ^_^

    @Neil Bradley: My article focuses mainly on the database side of getting UTF-8 working and this article covers the PHP work required. Between them it covers the whole picture of UTF-8 and CodeIgniter.
  • Awesome new post Elliot. Good to see you back. :D
    Phil Sturgeon had some additional UTF-8 support code ( http://philsturgeon.co.uk/news/2009/08/UTF-8-support-for-CodeIgniter ). Not sure if you have seen these. Wondering what your thoughts ar eon those?
blog comments powered by Disqus

Boring Stuff

Design © copyright Elliot Haughin 2009

Content published here are copyright their respective owners.

You cannot copy content from this site, either in English or translated to another language.

Keep Subscribed

Theres lots of ways for you to keep up with me on the web.

Please Note

Information given out on this blog should only be used as a guideline. I hold no liability for any code I write.

Always consult a professional before acting on this guidance.