Saturday, May 1, 2010

PHP: Convert Word Characters to UTF-8 HTML Numbered Entities


Have you ever tried to paste text from a Word document straight into a Web page and you get all these question marks "?", boxes or weird characters? What about when you store the same text to a database and you loss all those nice curly single quotes and curly double quotes? This will show you how to fix that and convert Microsoft Word ASCII characters [ISO-8859-1] to UTF-8 compliant HTML numbered entity codes (decimal).

The ISO 8859-1 Character Set

First you need to understand a few things about the ISO 8859-1 character set. There are 256 possible ranging from 0 to 255 and of these, some have not been assigned a value while others will display correctly on an ISO 8859-1 or UTF-8 Web page. You can display a character within PHP using the chr() function. Example chr(255). When displaying the characters in PHP, these characters repeat when greater then 255 such that chr(255) will be the same as chr(511) and chr(767).

So to jump straight to the point, the characters that need to be converted to a numbered entity range from 128 to 255, inclusively. There are probably a few of these characters that still display properly in Internet Explorer and not other browsers, but we will keep those characters included just to be thorough. Within this range, there are a few characters that need excluding which are 129, 141, 143, 144 and 157. They either are not assigned a value or the value doesn't display correctly; such would be the case with a backspace. Here is a nice visual ISO-8859-1 and Numbered Entities.

HTML Numbered Entities

Numbered entities are displayed on Web pages using the format of ÿ where the numerical value varies. The ISO 8859-1 character set range described above will match up with to the numbered entity equivalent so chr(255) is the same as ÿ. Keep in mind that &#255 is not the same as ǿ or ˿, so the desired range is restricted to that of the ISO-8859-1 range set in the above paragraph.

Converting the ISO-8859-1 Characters to HTML Numbered Entities

First a list will need to be made of all the characters that need converting. This can be accomplished best with a for loop and an array where the numbered entity is the key and the ASCII character is the value. Once that is done, you need to assign variables according to the desired conversion such as going from ASCII characters to numbered entities or vice-versa. A simple switch statement does the trick. Finally, you convert all the applicable characters within the string and return the converted value.

PHP Code:
function ConvertCharacters($String, $ConvertTo='entity') {
// Build character map list
$exclude = array(129, 141, 143, 144, 157);
for($i=128; $i<=255; $i++)
$characterMap['&#'.$i.';'] = chr($i);
foreach($exclude as $i)

// Assign find and replace variables
case 'ascii': // To ascii characters
$find = array_keys($characterMap);
$replace = array_values($characterMap);
case 'entity': // To numbered entities
$find = array_values($characterMap);
$replace = array_keys($characterMap);

// Convert characters within string and return results
return str_replace($find, $replace, $String);
} // ConvertCharacters()

You can call the PHP function using one of three methods:
$String = 'string of text';
$ConvertedString = ConvertCharacters($String); // Convert to entity
$ConvertedString = ConvertCharacters($String, 'entity'); // Convert to entity
$ConvertedString = ConvertCharacters($String, 'ascii');// Convert to ascii