NetBSD-Users archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]
Re: Unicode to ASCII
On Sun, 21 Feb 2021, Johnny Billquist wrote:
pdftotext is really ugly if it converts text and creates a stream of bytes, 
but if it's a Unicode character, it just creates all the bytes
required to encode the character. How can you in that case even differentiate 
between U+6161 and "AA" for example?
I presume, in such a case, that pdftotext will choose the non-surprising
behaviour of printing "AA" as "AA" rather than \x{61}\x{61} ;-)
''Converting Unicode to UTF-8 in a "lossless" manner'' makes no sense.
UTF-8 already is Unicode characters.
Well, they're separate things, actually (code points vs. an encoding
format)--better discussed in its own thread. You can use other
encoding formats for the same Unicode code points: UCS-2, UTF-16, UTF-32,
UTF-7, UCS-4, ...
It's always lossless. You can convert 
back and forth all day long.
Again. not quite, which is why I put quotes around my lossless. (Also,
my peculiar sense of humour getting in the way of good explanations.)
For example, here are 3 different UTF-8 encodings of the same Unicode
code-point for the character ASCII 'A':
A = 0x41
A = 0xC1 0x81
A = 0xE0 0x81 0x81
Proper implementations of UTF-8 are supposed to treat all 3 (or more!)
as the same, but, roll-your-own implementations generally don't--which
leads to black-hats cracking your website... (Also an interesting
topic better discussed elsewhere.)
-RVP
Home |
Main Index |
Thread Index |
Old Index