UTF-8 substr fot PHP

“[Pulling] out an arbitrary substring which happens to cut a 2 byte UTF-8 sequence breaks the string;

<?php
header ('Content-type: text/html; charset=utf-8');
 
$haystack = 'Iñtërnâtiônàlizætiøn';
 
// Position 13 is in the middle of the ô char
$substr = substr($haystack, 0, 13);
 
print "Substr: $substr<br>";

$substr now contains badly formed UTF-8 and your browser should display something wierd as a result (probably a ?)”

Handling UTF-8 with PHP”
phpwact.org

a comment moved due to layout issues

To go around this limitation, I used the following replacement substr code, which I extracted from “UTF-8 friendly replacement functions” v0.2, by Niels Leenheer & Andy Matsubara. For some reason, at the time of writing this, Google only seems to find a PDF version of this document.

function substr($str, $start , $length = NULL) {

             preg_match_all('/[\x01-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]/', $str, $arr);

             if (is_int($length))

                   return implode('', array_slice($arr[0], $start, $length));

             else

                   return implode('', array_slice($arr[0], $start));

      }