NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Unicode to ASCII



On Fri, 19 Feb 2021, Todd Gruhn wrote:

I extracted the "text" from a large PDF using a NetBSD prog called
pdftotext(1).

I got the desired ASCII text, but it has many occurances of the sequence
\x{80}\x{9c} ... \x{80}\x{9d}

Is there a nice and universal utility that can convert these to ASCII chars?


Those look like Unicode code points rather than UTF-8:

U+809c = https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=809C
U+809d = https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=809D

Rather than trying to convert to ASCII (which is either a) nonsensical,
or b) already being done, above, with the \x{} representation), what
you should do is set your locale to a UTF-8 one and then use the font
which covers the code-points you're likely to encounter.

If you set a proper locale then pdftotext can just convert Unicode
to UTF-8 in a "lossless" manner.

Put this minimal set of env. vars in ~/.xinitrc or ~/.xsession files:
(the NetBSD console doesn't handle UTF-8 natively yet, I think, so
this stuff below is not useful there)

For a US native:

export LANG=en_US.UTF-8
export LC_CTYPE=$LANG
export LC_ALL=""

For fonts (in xterm):
Use bitmap fonts with the widest glyph coverage:
$ xterm -fn -misc-*-r-normal--20-*-iso10646-1 \
	-fw -misc-*-r-normal-ko-18-*-iso10646-1 \
	-fg Black -bg Ivory ...

For TTF fonts, install `noto-ttf' (warning: 800MB+, but, you get
practically every font):

# pkgin install noto-ttf

Then, start xterm like this:

$ xterm -fa 'Noto Mono:style=Regular' \
	-fd 'Noto Sans Mono CJK JP:style=Regular' \
	...

You can choose other fonts for -fn and -fa (the standard ASCII ones)
if you want.

-RVP


Home | Main Index | Thread Index | Old Index