encoding - Word-wrapping multibyte characters in PHP for console output -


i'm trying align pieces of text fixed number of columns. text meant logging purposes , may contain user data, cannot assume input. ease of viewing want make sure that, when viewed in standard linux console text fixed number of characters wide.

php not have multibyte equivalent of wordwrap-function. several frameworks have own version of it. i've been trying out several multibyte functions, amongst these answers in this question in order encode utf-8 text unicode display. of answers seem use 'mb_strlen' calculate substring lengths.

to ensure file wanted display recognized utf-8, prepended byte order mark text. far know, 'should' cause linux recognize , format correctly.

then, tried encode long strings. strings such 'ëëëëëëëë...' correctly cut off after specified n character positions. however, string such '₩₩₩₩₩₩₩₩₩...' not. contains characters double length in unix console , cut off later supposed to. in fact, seems twice characters expected displayed in case, seemingly or of double width characters read width 1. constant 'mb_internal_encoding' correctly set utf-8, using:

mb_internal_encoding("utf-8"); 

note when set mb_internal_encoding other value other weird results, strings cut-off 50% fast, string of 70 characters gets cut off @ position 26. cut-offs happen in middle of characters resulting in mojibake.

my code line breaks in class work looks this, mb_wordwrap redirecting 1 of example functions:

private function breakintolines($text) {                     $dtxt = $this->mb_wordwrap($text, self::line_length, php_eol, true);     return explode(php_eol, $dtxt); } 

file writing happens using classic c-style file writing functions, following snippet write 1 of lines:

fwrite($this->file, $line); fwrite($this->file, php_eol); 

i verified output using gamut of utf characters , seem display correctly. seem lost problem, going on here?

note: trying write algorithm ourselves seems inordinate amount of work. there appears combining characters. example, sequence \u68\u0300 occupy 1 character in terminal. experimenting, code:

$str = json_decode("\"\\u0068\\u0300\""); var_dump($str);  echo mb_strlen($str); 

prints out: string(3) "h̀" 2

use grapheme functions

there nice , clean solution part of question. use mb_wordwrap function linked question, replace mb_* functions grapheme_* functions. grapheme part of 'intl' extension php , can correctly calculate character lengths. see this page more information.

thus, use grapheme_strlen instead of mb_strlen, , grapheme_substr instead of mb_substr, , on. inside of word-wrapping functions can think of string collection of graphemes, internally c-style php strings n ascii symbols each. way mb_* functions operate code retains same structure.

for example:

$str = json_decode("\"\\u0068\\u0300\""); var_dump($str);          echo grapheme_strlen($str); 

will output: string(3) "h̀" 1.

it still not correctly compute wider characters in 1 go. in order fix problems, need actual implementation. there c implementation used list of ranges double-width characters , longer list binary traversal 0-width characters, reimplemented in php here.

the code below should in theory word-wrap normal text correctly still while supporting strange utf-8 characters. not support languages other wrapping rules (wrapping @ non-space characters) , not support strange whitespace (whitespace outside ascii range). however, code should guarantee line widths of $width characters.

/**  * word-wrap multi-byte character (utf) string  * @param string $string initial string.   * @param int $width maximum width of line  * @param string $break character(s) line break  * @param bool $cut whether force-chop long words.   * @return string chopped string.   */ private function mb_wordwrap($string, $width = 75, $break = "\n", $cut = false) {     $string = (string) $string;     if ($string === '') {       return '';     }     $break = (string) $break;     if ($break === '') {       trigger_error('break string cannot empty', e_user_error);     }     $width = (int) $width;     if ($width === 0 && $cut) {       trigger_error('cannot force cut when width zero', e_user_error);     }     if (mb_check_encoding($string, 'ascii')) {       return wordwrap($string, $width, $break, $cut);     }     $result = '';             // width on display     // note: stringlength != stringwidth!     $breakwidth = $this->truestringwidth($string);     $laststartwidth = $lastspacewidth = 0;             // these measure 'length'.      // length in characters            $breaklength = strlen($break);     $laststartlength = $lastspacelength = 0;             $g_sz = grapheme_strlen($string);             $wpos = 0;     $lpos = 0;     // iterate on graphemes     // measure using truewidth     // cut using ascii (for speed)     for($i = 0; $i < $g_sz; ++$i) {                 $char = grapheme_substr($string, $i, 1);         $charlength = strlen($char);         $charwidth = $this->truestringwidth($char);           $lookahead_wpos = $wpos + $charwidth;                          // if have line break, preserve , start anew         if($breaklength !== 1) {             $possiblebreak = substr($string, $lpos, $breaklength);         } else {             $possiblebreak = substr($string, $lpos, $breaklength);         }         if ($possiblebreak === $break) {             $result .= substr($string, $laststartlength, $lpos - $laststartlength + $breaklength);             $lpos += $breaklength - $charlength;             $wpos += $breakwidth - $charwidth;              $laststartlength = $lastspacelength = $breaklength;             $laststartwidth = $lastspacewidth = $breakwidth;             continue;         }                     // if match 'whitespace' character,         if(preg_match("/\\h/u", $char)) {             // exclude space itself, not use lookahead             if($wpos - $laststartwidth >= $width) {                 $result .= substr($string, $laststartlength, $lpos - $laststartlength) . $break;                 $laststartlength = $lpos + $charlength;                 $laststartwidth = $wpos + $charwidth;             }             $lastspacewidth = $wpos;             $lastspacelength = $lpos;             continue;         }         // look-ahead 1 character         $nextchar = grapheme_substr($string, $i+1, 1);               // if overflow, , last space far back,          if($cut && $lookahead_wpos - $laststartwidth > $width && $laststartwidth >= $lastspacewidth) {             $result .= substr($string, $laststartlength, $lpos - $laststartlength) . $break;             $laststartlength = $lpos;             $laststartwidth = $wpos;             continue;                             }           if ($lookahead_wpos - $laststartwidth > $width && $laststartwidth < $lastspacewidth) {             $result .= substr($string, $laststartlength, $lastspacelength - $laststartlength) . $break;             $laststartlength = $lastspacelength = $lastspacelength + $charlength;             $laststartwidth = $lastspacewidth = $lastspacewidth + $charwidth;             continue;         }                     $wpos += $charwidth;            $lpos += $charlength;     }     if($laststartlength !== $lpos) {         $result .= substr($string, $laststartlength, $lpos - $laststartlength);     }     return $result; }  private function truestringwidth($str) {     $w = 0;     for($i = 0; $i < mb_strlen($str); ++$i) {         $char = mb_substr($str, $i, 1);         $w += $this->truecharwidth($char);     }     return $w; }  private function truecharwidth($char) {     $ucs = $this->uniord($char);     // non-unicode characters, return 1.      // consoles replace them 'replacement characters' have width 1!     if($ucs === false) {return 1;}      // bit math...      $combi = [         [ 0x0300, 0x036f ], [ 0x0483, 0x0486 ], [ 0x0488, 0x0489 ],         [ 0x0591, 0x05bd ], [ 0x05bf, 0x05bf ], [ 0x05c1, 0x05c2 ],         [ 0x05c4, 0x05c5 ], [ 0x05c7, 0x05c7 ], [ 0x0600, 0x0603 ],         [ 0x0610, 0x0615 ], [ 0x064b, 0x065e ], [ 0x0670, 0x0670 ],         [ 0x06d6, 0x06e4 ], [ 0x06e7, 0x06e8 ], [ 0x06ea, 0x06ed ],         [ 0x070f, 0x070f ], [ 0x0711, 0x0711 ], [ 0x0730, 0x074a ],         [ 0x07a6, 0x07b0 ], [ 0x07eb, 0x07f3 ], [ 0x0901, 0x0902 ],         [ 0x093c, 0x093c ], [ 0x0941, 0x0948 ], [ 0x094d, 0x094d ],         [ 0x0951, 0x0954 ], [ 0x0962, 0x0963 ], [ 0x0981, 0x0981 ],         [ 0x09bc, 0x09bc ], [ 0x09c1, 0x09c4 ], [ 0x09cd, 0x09cd ],         [ 0x09e2, 0x09e3 ], [ 0x0a01, 0x0a02 ], [ 0x0a3c, 0x0a3c ],         [ 0x0a41, 0x0a42 ], [ 0x0a47, 0x0a48 ], [ 0x0a4b, 0x0a4d ],         [ 0x0a70, 0x0a71 ], [ 0x0a81, 0x0a82 ], [ 0x0abc, 0x0abc ],         [ 0x0ac1, 0x0ac5 ], [ 0x0ac7, 0x0ac8 ], [ 0x0acd, 0x0acd ],         [ 0x0ae2, 0x0ae3 ], [ 0x0b01, 0x0b01 ], [ 0x0b3c, 0x0b3c ],         [ 0x0b3f, 0x0b3f ], [ 0x0b41, 0x0b43 ], [ 0x0b4d, 0x0b4d ],         [ 0x0b56, 0x0b56 ], [ 0x0b82, 0x0b82 ], [ 0x0bc0, 0x0bc0 ],         [ 0x0bcd, 0x0bcd ], [ 0x0c3e, 0x0c40 ], [ 0x0c46, 0x0c48 ],         [ 0x0c4a, 0x0c4d ], [ 0x0c55, 0x0c56 ], [ 0x0cbc, 0x0cbc ],         [ 0x0cbf, 0x0cbf ], [ 0x0cc6, 0x0cc6 ], [ 0x0ccc, 0x0ccd ],         [ 0x0ce2, 0x0ce3 ], [ 0x0d41, 0x0d43 ], [ 0x0d4d, 0x0d4d ],         [ 0x0dca, 0x0dca ], [ 0x0dd2, 0x0dd4 ], [ 0x0dd6, 0x0dd6 ],         [ 0x0e31, 0x0e31 ], [ 0x0e34, 0x0e3a ], [ 0x0e47, 0x0e4e ],         [ 0x0eb1, 0x0eb1 ], [ 0x0eb4, 0x0eb9 ], [ 0x0ebb, 0x0ebc ],         [ 0x0ec8, 0x0ecd ], [ 0x0f18, 0x0f19 ], [ 0x0f35, 0x0f35 ],         [ 0x0f37, 0x0f37 ], [ 0x0f39, 0x0f39 ], [ 0x0f71, 0x0f7e ],         [ 0x0f80, 0x0f84 ], [ 0x0f86, 0x0f87 ], [ 0x0f90, 0x0f97 ],         [ 0x0f99, 0x0fbc ], [ 0x0fc6, 0x0fc6 ], [ 0x102d, 0x1030 ],         [ 0x1032, 0x1032 ], [ 0x1036, 0x1037 ], [ 0x1039, 0x1039 ],         [ 0x1058, 0x1059 ], [ 0x1160, 0x11ff ], [ 0x135f, 0x135f ],         [ 0x1712, 0x1714 ], [ 0x1732, 0x1734 ], [ 0x1752, 0x1753 ],         [ 0x1772, 0x1773 ], [ 0x17b4, 0x17b5 ], [ 0x17b7, 0x17bd ],         [ 0x17c6, 0x17c6 ], [ 0x17c9, 0x17d3 ], [ 0x17dd, 0x17dd ],         [ 0x180b, 0x180d ], [ 0x18a9, 0x18a9 ], [ 0x1920, 0x1922 ],         [ 0x1927, 0x1928 ], [ 0x1932, 0x1932 ], [ 0x1939, 0x193b ],         [ 0x1a17, 0x1a18 ], [ 0x1b00, 0x1b03 ], [ 0x1b34, 0x1b34 ],         [ 0x1b36, 0x1b3a ], [ 0x1b3c, 0x1b3c ], [ 0x1b42, 0x1b42 ],         [ 0x1b6b, 0x1b73 ], [ 0x1dc0, 0x1dca ], [ 0x1dfe, 0x1dff ],         [ 0x200b, 0x200f ], [ 0x202a, 0x202e ], [ 0x2060, 0x2063 ],                     [ 0x206a, 0x206f ], [ 0x20d0, 0x20ef ], [ 0x302a, 0x302f ],         [ 0x3099, 0x309a ], [ 0xa806, 0xa806 ], [ 0xa80b, 0xa80b ],         [ 0xa825, 0xa826 ], [ 0xfb1e, 0xfb1e ], [ 0xfe00, 0xfe0f ],         [ 0xfe20, 0xfe23 ], [ 0xfeff, 0xfeff ], [ 0xfff9, 0xfffb ],         [ 0x10a01, 0x10a03 ], [ 0x10a05, 0x10a06 ], [ 0x10a0c, 0x10a0f ],         [ 0x10a38, 0x10a3a ], [ 0x10a3f, 0x10a3f ], [ 0x1d167, 0x1d169 ],         [ 0x1d173, 0x1d182 ], [ 0x1d185, 0x1d18b ], [ 0x1d1aa, 0x1d1ad ],         [ 0x1d242, 0x1d244 ], [ 0xe0001, 0xe0001 ], [ 0xe0020, 0xe007f ],         [ 0xe0100, 0xe01ef ]       ];  /* test 8-bit control characters */ if ($ucs === 0)   return 0; if ($ucs < 32 || ($ucs >= 0x7f && $ucs < 0xa0))   return 0;  /* binary search in table of non-spacing characters */ if ($this->binaryintervalsearch($combi, $ucs))   return 0;  /* if arrive here, ucs not combining or c0/c1 control character */  return 1 +    ($ucs >= 0x1100 &&    ($ucs <= 0x115f ||                    /* hangul jamo init. consonants */    $ucs == 0x2329 || $ucs == 0x232a ||   ($ucs >= 0x2e80 && $ucs <= 0xa4cf &&    $ucs != 0x303f) ||                  /* cjk ... yi */   ($ucs >= 0xac00 && $ucs <= 0xd7a3) || /* hangul syllables */   ($ucs >= 0xf900 && $ucs <= 0xfaff) || /* cjk compatibility ideographs */   ($ucs >= 0xfe10 && $ucs <= 0xfe19) || /* vertical forms */   ($ucs >= 0xfe30 && $ucs <= 0xfe6f) || /* cjk compatibility forms */   ($ucs >= 0xff00 && $ucs <= 0xff60) || /* fullwidth forms */   ($ucs >= 0xffe0 && $ucs <= 0xffe6) ||   ($ucs >= 0x20000 && $ucs <= 0x2fffd) ||   ($ucs >= 0x30000 && $ucs <= 0x3fffd))); }  private function uniord($c) {     if (ord($c{0}) >=0 && ord($c{0}) <= 127) {         return ord($c{0});                  }     if (ord($c{0}) >= 192 && ord($c{0}) <= 223) {         return (ord($c{0})-192)*64 + (ord($c{1})-128);     }     if (ord($c{0}) >= 224 && ord($c{0}) <= 239) {         return (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);     }     if (ord($c{0}) >= 240 && ord($c{0}) <= 247) {         return (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64                  + (ord($c{3})-128);     }     if (ord($c{0}) >= 248 && ord($c{0}) <= 251) {         return (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-                 128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);     }     if (ord($c{0}) >= 252 && ord($c{0}) <= 253) {         return (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-                128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 +                 (ord($c{5})-128);     }     if (ord($c{0}) >= 254 && ord($c{0}) <= 255) {    //  error          return false;     }     return 0; }   //  function _uniord()    // assumed interval array sorted! // assumed have simple array (indexed 0, 1, 2, ...).  private function binaryintervalsearch($array, $element) {     if(count($array) === 1) {         if($array[0][0] <= $element && $element <= $array[0][1]) {             return true;         } else {             return false;         }     } else if(count($array) === 0) {         return false;     }             // split array 2 halves , central element.      $tc = count($array) >> 1;     // rightmost left element     if($array[$tc-1][1] >= $element) {         return $this->binaryintervalsearch(array_slice($array, 0, $tc), $element);     } else if($array[$tc][0] <= $element) {         return $this->binaryintervalsearch(array_slice($array, $tc), $element);     }     return false;    } 

Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

python - Django-cities exits with "killed" -

python - How to get a widget position inside it's layout in Kivy? -