NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode to ASCII



On Sun, 21 Feb 2021, Johnny Billquist wrote:

pdftotext is really ugly if it converts text and creates a stream of bytes, but if it's a Unicode character, it just creates all the bytes required to encode the character. How can you in that case even differentiate between U+6161 and "AA" for example?


I presume, in such a case, that pdftotext will choose the non-surprising
behaviour of printing "AA" as "AA" rather than \x{61}\x{61} ;-)


''Converting Unicode to UTF-8 in a "lossless" manner'' makes no sense.

UTF-8 already is Unicode characters.


Well, they're separate things, actually (code points vs. an encoding
format)--better discussed in its own thread. You can use other
encoding formats for the same Unicode code points: UCS-2, UTF-16, UTF-32,
UTF-7, UCS-4, ...

It's always lossless. You can convert back and forth all day long.


Again. not quite, which is why I put quotes around my lossless. (Also,
my peculiar sense of humour getting in the way of good explanations.)

For example, here are 3 different UTF-8 encodings of the same Unicode
code-point for the character ASCII 'A':

A = 0x41
A = 0xC1 0x81
A = 0xE0 0x81 0x81

Proper implementations of UTF-8 are supposed to treat all 3 (or more!)
as the same, but, roll-your-own implementations generally don't--which
leads to black-hats cracking your website... (Also an interesting
topic better discussed elsewhere.)

-RVP


Home | Main Index | Thread Index | Old Index