src log message improvement script

To: tech-repository%NetBSD.org@localhost
Subject: src log message improvement script
From: Thomas Klausner <wiz%NetBSD.org@localhost>
Date: Tue, 22 Aug 2023 20:17:09 +0200

Hi!

The current conversion infrastructure I packaged in wip/cvs2hg using
devel/py-hg-fastimport expects log messages to be in UTF-8 encoding.
(An earlier version of py-hg-fastimport fell back to ISO-8859-1 if a
string was not convertible as UTF-8.)

For this reason I've already been fixing some encoding problems in the
other repositories I've been converting.

For src/ this became too much to be done by hand, because there were
numerous imports and other commits that touched a lot of files and
were in ISO-8859-1 encoding.

So I wrote a script for that. It parses RCS files, only looking at the
commit messages, and tries to parse each line as UTF-8. If this
doesn't work, it parses it as ISO-8859-1. If there were any changes,
it writes the file out in UTF-8 encoding.

(I originally tried converting each log message as a whole, but there
is at least one import that has a single log message containing
characters in both UTF-8 and ISO-8859-1; that's why I switched to the
line-based approach.)

When looking at the generated changes, I noticed that there were some
other encodings used (e.g. ISO-8859-2). So I fixed up those entries as
I noticed them.

The attached script, when run on the src repository from earlier this
month, generates a diff that looks good to me. I've attached it.

(I have not tried converting the resulting repository because my main
machine is not stable enough right now for a conversion taking more
than one week, like src does on it.)

Please review the script and/or the diff. I'd like admins to run it on
src/ when we're happy with it.

Cheers,
 Thomas

#!/usr/bin/env python3
#
# pylint recognize some variables incorrectly as constants
# pylint: disable=C0103

import argparse


def end_of_log(log_lines):
    '''Try decoding as utf-8 first, then iso-8859-1'''
    decoded = []
    changes = False
    for log_line in log_lines:
        try:
            decoded_line = log_line.decode('utf-8')
            if args.verbose:
                print(f'utf-8: ${decoded_line}')
        except UnicodeDecodeError:
            decoded_line = log_line.decode('iso-8859-1')
            changes = True
            if args.verbose:
                print(f'iso-8859-1: ${decoded_line}')
        decoded.append(decoded_line)
    result = []
    for dec_line in decoded:
        line = dec_line
        line = line.replace('Doleèek', 'Doleček')
        line = line.replace('"access"âand', '"access" and')
        line = line.replace("Hereâs", "Here's")
        line = line.replace("donât", "don't")
        line = line.replace("Iâd", "I'd")
        line = line.replace("itâs", "it's")
        line = line.replace("canât", "can't")
        line = line.replace("theyâve", "they've")
        if line != dec_line:
            changes = True
        result.append(line.encode('utf-8'))
    return changes, result


def read_file(f):
    in_body = False
    in_log = False
    at_count = 0
    lines = f.readlines()
    output = []
    changes = False
    for line in lines:
        if args.verbose:
            print(f'line {line}')
        if line == b'desc\n':
            in_body = True
            if args.verbose:
                print(f'desc: {line}')
            output.append(line)
        elif in_body and line == b'log\n':
            in_log = True
            at_count = 0
            log = []
            if args.verbose:
                print(f'body: {line}')
            output.append(line)
        elif in_log:
            if args.verbose:
                print(f'in_log: {line}')
            at_count += line.count(b'@')
            log.append(line)
            if at_count % 2 == 0:
                more_changes, result = end_of_log(log)
                if more_changes:
                    changes = True
                in_log = False
                output += result
        else:
            output.append(line)
    return changes, output


parser = argparse.ArgumentParser(description='Convert encoding of RCS log ' +
                                 'messages from iso-8859-1 to utf-8')
parser.add_argument('file', help='RCS file to edit')
parser.add_argument('-v', dest='verbose', help='RCS file to edit')
args = parser.parse_args()


changed = False
with open(args.file, 'rb') as file:
    changed, file_data = read_file(file)

if changed:
    with open(args.file, 'wb') as file:
        file.writelines(file_data)

Attachment: src.diff.xz
Description: application/xz

Prev by Date: Re: hg performance
Next by Date: Repository conversion is temporarily suspended
Previous by Thread: hg performance
Next by Thread: Repository conversion is temporarily suspended
Indexes:

Home | Main Index | Thread Index | Old Index