NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode to ASCII



On 2021-02-21 07:42, RVP wrote:
On Fri, 19 Feb 2021, Todd Gruhn wrote:

I extracted the "text" from a large PDF using a NetBSD prog called
pdftotext(1).

I got the desired ASCII text, but it has many occurances of the sequence
\x{80}\x{9c} ... \x{80}\x{9d}

Is there a nice and universal utility that can convert these to ASCII chars?


Those look like Unicode code points rather than UTF-8:

I agree that that isn't UTF-8. An UTF-8 encoded character
cannot start with 0x80.

pdftotext is really ugly if it converts text and creates a stream of bytes, but if it's a Unicode character, it just creates all the bytes required to encode the character. How can you in that case even differentiate between U+6161 and "AA" for example?

U+809c = https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=809C
U+809d = https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=809D

Rather than trying to convert to ASCII (which is either a) nonsensical,
or b) already being done, above, with the \x{} representation), what
you should do is set your locale to a UTF-8 one and then use the font
which covers the code-points you're likely to encounter.

If you set a proper locale then pdftotext can just convert Unicode
to UTF-8 in a "lossless" manner.

''Converting Unicode to UTF-8 in a "lossless" manner'' makes no sense.

UTF-8 already is Unicode characters. It's always lossless. You can convert back and forth all day long.

And I was assuming that the representation \x{80} was a way to display the non-printable character 0x80. But if it literally put "\x{80}" in the stream, then it's again a different story...

  Johnny

--
Johnny Billquist                  || "I'm on a bus
                                  ||  on a psychedelic trip
email: bqt%softjar.se@localhost             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


Home | Main Index | Thread Index | Old Index