tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: wide characters and i18n



On Jul 16, 2010, at 11:50 AM, Sad Clouds wrote:

> Sometimes I do it from left to right, but occasionally I may need to do
> it from right to left. For example if you have a filename:
> 
> some_long_file_name.txt
> 
> To quickly extract the suffix '.txt' you just scan the string from
> right to left, until you hit '.' char. I think with utf-8 this type of
> string manipulation would be quite messy and you would have to use a
> special library that understands utf-8 encodings, etc.

Nope, because:

- ASCII characters are expressed in utf-8 identically ('.' is '.')
- No non-ASCII utf-8 character includes in its multibyte representation any 
byte which is also an ASCII character (all bytes of multibyte utf-8 characters 
have the high bit set). Thus, you can't accidentally mistake part of some other 
character as '.'

Thus, any kind or processing dealing with searching for ascii characters ('.', 
'/', newline, spaces, etc) can safely be ignorant of utf-8.

Of course some things need to care, like counting characters (vs bytes), 
truncation (to make sure it's not in the middle of a multibyte character), etc, 
etc, but there are many cases where it just doesn't matter. Good old strrchr() 
would do just fine for your example here.

Even if you do need to scan utf-8 backwards, it's not SO hard, because it's 
easy to tell when you got to the beginning of the character (high two bits of 
the first byte are 11, vs 10 for additional bytes)


Home | Main Index | Thread Index | Old Index