c - Workaround for glibc's printf truncation bug in multi-byte locales? -


certain gnu-based os distros (debian) still impacted bug in gnu libc causes printf family of functions return bogus -1 when specified level of precision truncate multi-byte character. bug fixed in 2.17 , backported 2.16. debian has archived bug this, maintainers appear have no intention of backporting fix 2.13 used wheezy.

the text below quoted https://sourceware.org/bugzilla/show_bug.cgi?id=6530. (please not edit block quoting inline again.)

here's simpler testcase bug courtesy of jonathan nieder:

#include <stdio.h> #include <locale.h>  int main(void) {     int n;      setlocale(lc_ctype, "");     n = printf("%.11s\n", "author: \277");     perror("printf");     fprintf(stderr, "return value: %d\n", n);     return 0; } 

under c locale that'll right thing:

$ lang=c ./test author: &#65533; printf: success return value: 10 

but not under utf-8 locale, since \277 isn't valid utf-8 sequence:

$ lang=en_us.utf8 ./test printf: invalid or incomplete multibyte or wide character 

it's worth noting printf overwrite first character of output array \0 in context.

i trying retrofit mud codebase support utf-8, , unfortunately code riddled cases arbitrary sprintf precision used limit how text sent output buffers. problem made worse fact programmers don't expect -1 return in context, can result in uninitialized memory reads , badness cascades down that. (already caught few cases in valgrind)

has come concise workaround bug in code doesn't involve rewriting every single invocation of formatting string arbitrary length precision? i'm fine truncated utf-8 characters being written output buffer it's trivial clean in output processing prior socket write, , seems overkill invest effort in problem go away given few more years.

i'm guessing, , seems confirmed the comments question, don't use of c library's locale specific functionality. in case you'd better off not changing locale utf-8 based one, , leaving in single-byte locale code assumes.

when need process utf-8 strings utf-8 strings can use specialized code. it's not hard write own utf-8 processing routines. can download unicode character database , sophisticated character classification. if you'd prefer use third party library handle utf-8 strings there's icu mentioned in comments. it's pretty heavyweight library though, previous question recommends few lighter weight alternatives.

it might possible switch c locale , forth necessary can use c library's functionality. you'll want check performance impact of however, switching locales can expensive operation.


Comments

Popular posts from this blog

javascript - Jquery show_hide, what to add in order to make the page scroll to the bottom of the hidden field once button is clicked -

javascript - Highcharts multi-color line -

javascript - Enter key does not work in search box -