Re: Unicode to ASCII

To: Netbsd-Users-List <netbsd-users%netbsd.org@localhost>
Subject: Re: Unicode to ASCII
From: Bob Proulx <bob%proulx.com@localhost>
Date: Fri, 19 Feb 2021 20:08:25 -0700

Todd Gruhn wrote:
> I extracted the "text" from a large PDF using a NetBSD prog called
> pdftotext(1).

pdftotext is really awesome.  I find "pdftotext -layout" to do a truly
excellent job with most PDF files I need to deal with from banks and
things here.

> I got the desired ASCII text, but it has many occurances of the sequence
> \x{80}\x{9c} ... \x{80}\x{9d}

Do you know what charset that is in natively?

> Is there a nice and universal utility that can convert these to ASCII chars?
> Someone mentioned EMACS... What about in pkgsrc?

I'll be honest and say I did not look but on another system I am using
"iconv" for this type of thing routinely.  I will cross my fingers and
hope it is available in pkgsrc.

    iconv -f UTF-8 -t ASCII//TRANSLIT <filein >fileout

That's assuming UTF-8 in and ASCII out but you will probably want some
other code set like this or another code page.

    iconv -f CP1252 -t UTF-8 <filein >fileout

Hopefully even if incomplete it might still be useful.

Bob

Follow-Ups:
- Re: Unicode to ASCII
  - From: Silas
- Re: Unicode to ASCII
  - From: Martin Husemann

References:
- Unicode to ASCII
  - From: Todd Gruhn

Prev by Date: Unicode to ASCII
Next by Date: Re: Unicode to ASCII
Previous by Thread: Unicode to ASCII
Next by Thread: Re: Unicode to ASCII
Indexes:

Home | Main Index | Thread Index | Old Index