UTF8 allows your site to represent characters other than those in the basic english alphabet. More often than not, your CodeIgniter Application will contain methods where users can enter their name. This is where you’ll most commonly see unusual characters cropping up. To make sure your site can properly represent all of these to the browser, you need to use UTF8 encoding, a way of encoding any unicode character into 1-4 bytes of data.
This guide assumes you are reasonably competent in installing php extensions, adding config variables to your php.ini, and using MY_ CodeIgniter overloading. If you’re not sure about any of these, please make sure you consult a professional.
PHP
PHP has a few issues whilst using UTF8. Because it encodes each character using a variable length, some characters can become longer than one byte. There are a few ‘multibyte unsafe operations’ in PHP which do not detect characters greater than one byte in length. To fix this, we can use mbstring.
While there are many languages in which every necessary character can be represented by a one-to-one mapping to an 8-bit value, there are also several languages which require so many characters for written communication that they cannot be contained within the range a mere byte can code (A byte is made up of eight bits. Each bit can contain only two distinct values, one or zero. Because of this, a byte can only represent 256 unique values (two to the power of eight)). Multibyte character encoding schemes were developed to express more than 256 characters in the regular bytewise coding system.
When you manipulate (trim, split, splice, etc.) strings encoded in a multibyte encoding, you need to use special functions since two or more consecutive bytes may represent a single character in such encoding schemes. Otherwise, if you apply a non-multibyte-aware string function to the string, it probably fails to detect the beginning or ending of the multibyte character and ends up with a corrupted garbage string that most likely loses its original meaning.
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience (listed below).
Once you’ve installed, configured and enabled this extension, several core php functions will be automatically overloaded by mbstring. Here’s a list of the functions automatically overloaded
You can install mbstring a few different ways, for simplicity, here’s how you would do it using a package manager (Aptitude) with Ubunutu:
sudo apt-get install php5-mbstring
Once the installation is complete, you’ll need to enable the extension and add some configuration to your php.ini file:
extension=php_mbstring.so [mbstring] mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default) mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8 mbstring.encoding_translation = On ; HTTP input encoding translation is enabled mbstring.http_input = auto ; Set HTTP input character set dectection to auto mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8 mbstring.detect_order = auto ; Set default character encoding detection order to auto mbstring.substitute_character = none ; Do not print invalid characters default_charset = UTF-8 ; Default mbstring.func_overload = 7
With that out of the way, restart your webserver, and you’re set up with mbstring.
Database
Because the database may be storing some of our unsual characters, we also need to select an encoding that will support it. On your database, change the encoding type to a UTF-8 compatible one. Under MySQL, I tend to use: “utf8_general_ci”
CodeIgniter
We all know and love CodeIgniter, our favourite PHP Framework. But, by default there’s some places that CodeIgniter just doesn’t deal with UTF-8 encoded characters very well. But, because CodeIgniter is almost completely flexible, we can modify some of the core methods and have them overload the default methods, fixing the problems.
If you’re using the form_helper, you’ll need to modify the form_open method to tell the form to use UTF-8 encoding. This new file will also fix a potential issue with htmlspecialchars.
Create the file: application/helpers/MY_form_helper.php containing:
Now we move on to the slightly more tricky stuff. The XMLRPC library bundled with CodeIgniter has some code that isn’t safe with UTF-8. In particular, htmlentities and htmlspecialchars. This file is pretty big, so I won’t put it all here.
MY_Xmlrpc.php
Place this file in application/libraries/MY_Xmlrpc.php
The email library has a similar problem, so here’s a version of the email library with the fix in place:
MY_Email.php
Place this file in application/libraries/MY_Email.php
Now, finally, before we go ahead and build our application, we need to set the header.php in our view files (or your version of this) to use UTF-8:
So, you should now be ready to build your UTF-8 web application in CodeIgniter, I’m not going to leave you without an extra helper to make the journey a little easier. First of all, create a new config file:
application/config/international.php
Now create a new helper:
application/helpers/international.php
There’s some helpful methods in here like: utf8_to_uri() which takes characters and converts them into standard alphabet characters:
$utf8 = "Iñtërnâtiônàlizætiøn"; echo utf8_to_uri($uft8); // Produces: (string) "Internationalization"
As you can see, unicode isn’t exactly straight forward. But, if you want to build a true international well application, it certainly is important!