Re: wide characters and i18n

To: Sad Clouds <cryintothebluesky%googlemail.com@localhost>
Subject: Re: wide characters and i18n
From: Ty Sarna <ty%sarna.org@localhost>
Date: Fri, 16 Jul 2010 12:34:31 -0400

On Jul 16, 2010, at 11:50 AM, Sad Clouds wrote:

> Sometimes I do it from left to right, but occasionally I may need to do
> it from right to left. For example if you have a filename:
> 
> some_long_file_name.txt
> 
> To quickly extract the suffix '.txt' you just scan the string from
> right to left, until you hit '.' char. I think with utf-8 this type of
> string manipulation would be quite messy and you would have to use a
> special library that understands utf-8 encodings, etc.

Nope, because:

- ASCII characters are expressed in utf-8 identically ('.' is '.')
- No non-ASCII utf-8 character includes in its multibyte representation any 
byte which is also an ASCII character (all bytes of multibyte utf-8 characters 
have the high bit set). Thus, you can't accidentally mistake part of some other 
character as '.'

Thus, any kind or processing dealing with searching for ascii characters ('.', 
'/', newline, spaces, etc) can safely be ignorant of utf-8.

Of course some things need to care, like counting characters (vs bytes), 
truncation (to make sure it's not in the middle of a multibyte character), etc, 
etc, but there are many cases where it just doesn't matter. Good old strrchr() 
would do just fine for your example here.

Even if you do need to scan utf-8 backwards, it's not SO hard, because it's 
easy to tell when you got to the beginning of the character (high two bits of 
the first byte are 11, vs 10 for additional bytes)

Follow-Ups:
- Re: wide characters and i18n
  - From: Sad Clouds

References:
- Re: wide characters and i18n
  - From: Sad Clouds
- Re: wide characters and i18n
  - From: Ken Hornstein
- Re: wide characters and i18n
  - From: Sad Clouds

Prev by Date: Re: wide characters and i18n
Next by Date: Re: wide characters and i18n
Previous by Thread: Re: wide characters and i18n
Next by Thread: Re: wide characters and i18n
Indexes:

Home | Main Index | Thread Index | Old Index