“[Pulling] out an arbitrary substring which happens to cut a 2 byte UTF-8 sequence breaks the string;
<?php header ('Content-type: text/html; charset=utf-8'); $haystack = 'Iñtërnâtiônàlizætiøn'; // Position 13 is in the middle of the ô char $substr = substr($haystack, 0, 13); print "Substr: $substr<br>";
$substr
now contains badly formed UTF-8 and your browser should display something wierd as a result (probably a ?)”Handling UTF-8 with PHP”
phpwact.orga comment moved due to layout issues
To go around this limitation, I used the following replacement substr code, which I extracted from “UTF-8 friendly replacement functions” v0.2, by Niels Leenheer & Andy Matsubara. For some reason, at the time of writing this, Google only seems to find a PDF version of this document.
function substr($str, $start , $length = NULL) {
preg_match_all('/[\x01-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]/', $str, $arr);
if (is_int($length))
return implode('', array_slice($arr[0], $start, $length));
else
return implode('', array_slice($arr[0], $start));
}