NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode to ASCII



Silas wrote:
> Bob Proulx wrote:
> >    iconv -f UTF-8 -t ASCII//TRANSLIT <filein >fileout
> 
> It seems it is not possible on NetBSD 9.0 iconv :-(

It looks like //TRANSLIT is a GNU glibc extension not available in
NetBSD's version of libc.  Sorry.

> $ echo 'pão' | iconv -f UTF-8 -t ASCII//TRANSLIT
> iconv: iconv_open(ASCII//TRANSLIT, UTF-8): Invalid argument

I can use iconv to translate from one codeset to another but it
doesn't know how to transliterate.  It's not listed in the
documentation for it.

    man iconv

     -t    Specifies the destination codeset name as to_name.

And that is all it says.  So can change codesets.

    $ echo 'pão' | iconv -f UTF-8 -t LATIN1 | od -tx1 -c
    0000000   70  e3  6f  0a                                                
      p 343   o  \n                                                

I passed the output through od to show the e3 of it in LATIN1 to avoid
the mismash of it here in what will be a UTF-8 mailing.  But I can
show that it can be converted back.

    $ echo 'pão' | iconv -f UTF-8 -t LATIN1 | iconv -f LATIN1 -t UTF-8
    pão

> Is there something that could be installed from pkgsrc (or another
> iconv implementation) to make it work?

For transliteration it looks like you would need the GNU version of
iconv.  Sorry!

    https://manpages.debian.org/buster/manpages/iconv.1.en.html

    -t to-encoding, --to-code=to-encoding
        Use to-encoding for output characters.

    	If the string //IGNORE is appended to to-encoding, characters that
    	cannot be converted are discarded and an error is printed after
    	conversion.

    	If the string //TRANSLIT is appended to to-encoding, characters
    	being converted are transliterated when needed and possible. This
    	means that when a character cannot be represented in the target
    	character set, it can be approximated through one or several
    	similar looking characters. Characters that are outside of the
    	target character set and cannot be transliterated are replaced
    	with a question mark (?) in the output.

Bob


Home | Main Index | Thread Index | Old Index