I came across an annoying bug in my application where for some reason my Euro symbol wasn’t being HTML entity decoded for use in a PDF as opposed to the other available symbols which were. At first I assumed that this might be a bug in the PHP html_entity_decode function, but of course, a quick trip to the official PHP documentation proved me to be completely and utterly wrong.
The problem was in the character set all along!
Most of the time we simply run html_entity_decode by passing it the string we want decoded and perhaps the flag which controls whether or not to affect quotes. However, there is a third, often overlooked parameter which controls the character set which PHP needs to use when decoding these HTML entities – and that’s where the trick lies!
By default, if PHP can’t recognise the character set of the string passed to it, it assumes a character set of ISO-8859-1, known as Western European, Latin-1. However, this particular character set omits the Euro sign as well as a few French and Finnish letters, which are all added in ISO-8859-15, or Western European, Latin-9.
So in order to successfully decode our Euro symbol containing string, we simply need to run:
$decoded = html_entity_decode($eurostring,ENT_QUOTES, 'ISO-8859-15');
And now you know.
For a reference, these are the character sets which are supported:
- ISO-8859-1 | ISO8859-1 | Western European, Latin-1
- ISO-8859-15 | ISO8859-15 | Western European, Latin-9. Adds the Euro sign, French and Finnish letters missing in Latin-1(ISO-8859-1).
- UTF-8 | ASCII compatible multi-byte 8-bit Unicode.
- cp866 | ibm866, 866 | DOS-specific Cyrillic charset. This charset is supported in 4.3.2.
- cp1251 | Windows-1251, win-1251, 1251 | Windows-specific Cyrillic charset. This charset is supported in 4.3.2.
- cp1252 | Windows-1252, 1252 | Windows specific charset for Western European.
- KOI8-R | koi8-ru, koi8r | Russian. This charset is supported in 4.3.2.
- BIG5 | 950 | Traditional Chinese, mainly used in Taiwan.
- GB2312 | 936 | Simplified Chinese, national standard character set.
- BIG5-HKSCS | Big5 with Hong Kong extensions, Traditional Chinese.
- Shift_JIS | SJIS, 932 | Japanese
- EUC-JP | EUCJP | Japanese