NetBSD-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: Ideas for stripping tags from document



On 2021-01-18 00:21, Todd Gruhn wrote:
HEY Johnny, that thing with  tr -d did not work. When I read the
manpage I got and idea:

[...]

That's weird. tr -d should definitely work. But...

character classes (in this case [:cntrl;]). It turns out that one can do

s/[[:cntrl]]/\n/g

using PERL. That fixed the prob with \x{d}. I still need to fix \x{92}
, \x{93}, etc

It would be nice to do: system(tr -d .... $text). Then write the
result to filehandle.
Where do you get the octal vals for \x{92} , \x{93} , etc ?

Uh... I said 'tr -d "\223\224\225" < infile > outfile'. You cannot do 'tr -d "\223\224\225" "text"'.

tr -d only takes one argument. It then applies the work on stdin, and outputs the result to stdout.

Octal values from hex values? Plenty of ways to do that, if you can't even figure it out in your head. But in this case, it's actually rather trivial to do in the head.
0x80 is 0200. That leaves 0x13. 0x10 is 020. Left with 3 -> 223.
But I also have my trust HP calculators around, and also some weird operating systems that have easy commands to just convert, and then again, you could also just use bc...

Smurf:/Users/bqt> bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
ibase=16
obase=8
93
223


  Johnny


On Sun, Jan 17, 2021 at 5:35 AM Johnny Billquist <bqt%update.uu.se@localhost> wrote:

On 2021-01-17 10:57, Ignatios Souvatzis (GSG) wrote:


Am 17. Januar 2021 00:01:23 MEZ schrieb Johnny Billquist <bqt%update.uu.se@localhost>:
On 2021-01-16 19:45, Todd Gruhn wrote:
I have a large document (18,000L). It is full of tags such as <93>
,<94> , <95> .

If I view the doc in a PERL editor I see \x{93} , \x{94} , \{95} ...

Is there a pkg or command to strip these tags and leave the text ?

tr -d "\223\224\225" < infile > outfile

I,d convert them to ", ",and maybe *, if you really want pure ASCII, but yes.

Well, he did ask how to strip them.

But sure, tr can be used for replacing them with other characters as
well, obviously. Trivial, in fact.

    Johnny

--
Johnny Billquist                  || "I'm on a bus
                                    ||  on a psychedelic trip
email: bqt%softjar.se@localhost             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


--
Johnny Billquist                  || "I'm on a bus
                                  ||  on a psychedelic trip
email: bqt%softjar.se@localhost             ||  Reading murder books
pdp is alive!                     ||  tryin' to stay hip" - B. Idol


Home | Main Index | Thread Index | Old Index