NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode to ASCII



Todd Gruhn wrote:
> I extracted the "text" from a large PDF using a NetBSD prog called
> pdftotext(1).

pdftotext is really awesome.  I find "pdftotext -layout" to do a truly
excellent job with most PDF files I need to deal with from banks and
things here.

> I got the desired ASCII text, but it has many occurances of the sequence
> \x{80}\x{9c} ... \x{80}\x{9d}

Do you know what charset that is in natively?

> Is there a nice and universal utility that can convert these to ASCII chars?
> Someone mentioned EMACS... What about in pkgsrc?

I'll be honest and say I did not look but on another system I am using
"iconv" for this type of thing routinely.  I will cross my fingers and
hope it is available in pkgsrc.

    iconv -f UTF-8 -t ASCII//TRANSLIT <filein >fileout

That's assuming UTF-8 in and ASCII out but you will probably want some
other code set like this or another code page.

    iconv -f CP1252 -t UTF-8 <filein >fileout

Hopefully even if incomplete it might still be useful.

Bob


Home | Main Index | Thread Index | Old Index