Hi! The current conversion infrastructure I packaged in wip/cvs2hg using devel/py-hg-fastimport expects log messages to be in UTF-8 encoding. (An earlier version of py-hg-fastimport fell back to ISO-8859-1 if a string was not convertible as UTF-8.) For this reason I've already been fixing some encoding problems in the other repositories I've been converting. For src/ this became too much to be done by hand, because there were numerous imports and other commits that touched a lot of files and were in ISO-8859-1 encoding. So I wrote a script for that. It parses RCS files, only looking at the commit messages, and tries to parse each line as UTF-8. If this doesn't work, it parses it as ISO-8859-1. If there were any changes, it writes the file out in UTF-8 encoding. (I originally tried converting each log message as a whole, but there is at least one import that has a single log message containing characters in both UTF-8 and ISO-8859-1; that's why I switched to the line-based approach.) When looking at the generated changes, I noticed that there were some other encodings used (e.g. ISO-8859-2). So I fixed up those entries as I noticed them. The attached script, when run on the src repository from earlier this month, generates a diff that looks good to me. I've attached it. (I have not tried converting the resulting repository because my main machine is not stable enough right now for a conversion taking more than one week, like src does on it.) Please review the script and/or the diff. I'd like admins to run it on src/ when we're happy with it. Cheers, Thomas
#!/usr/bin/env python3
#
# pylint recognize some variables incorrectly as constants
# pylint: disable=C0103
import argparse
def end_of_log(log_lines):
'''Try decoding as utf-8 first, then iso-8859-1'''
decoded = []
changes = False
for log_line in log_lines:
try:
decoded_line = log_line.decode('utf-8')
if args.verbose:
print(f'utf-8: ${decoded_line}')
except UnicodeDecodeError:
decoded_line = log_line.decode('iso-8859-1')
changes = True
if args.verbose:
print(f'iso-8859-1: ${decoded_line}')
decoded.append(decoded_line)
result = []
for dec_line in decoded:
line = dec_line
line = line.replace('Doleèek', 'Doleček')
line = line.replace('"access"âand', '"access" and')
line = line.replace("Hereâs", "Here's")
line = line.replace("donât", "don't")
line = line.replace("Iâd", "I'd")
line = line.replace("itâs", "it's")
line = line.replace("canât", "can't")
line = line.replace("theyâve", "they've")
if line != dec_line:
changes = True
result.append(line.encode('utf-8'))
return changes, result
def read_file(f):
in_body = False
in_log = False
at_count = 0
lines = f.readlines()
output = []
changes = False
for line in lines:
if args.verbose:
print(f'line {line}')
if line == b'desc\n':
in_body = True
if args.verbose:
print(f'desc: {line}')
output.append(line)
elif in_body and line == b'log\n':
in_log = True
at_count = 0
log = []
if args.verbose:
print(f'body: {line}')
output.append(line)
elif in_log:
if args.verbose:
print(f'in_log: {line}')
at_count += line.count(b'@')
log.append(line)
if at_count % 2 == 0:
more_changes, result = end_of_log(log)
if more_changes:
changes = True
in_log = False
output += result
else:
output.append(line)
return changes, output
parser = argparse.ArgumentParser(description='Convert encoding of RCS log ' +
'messages from iso-8859-1 to utf-8')
parser.add_argument('file', help='RCS file to edit')
parser.add_argument('-v', dest='verbose', help='RCS file to edit')
args = parser.parse_args()
changed = False
with open(args.file, 'rb') as file:
changed, file_data = read_file(file)
if changed:
with open(args.file, 'wb') as file:
file.writelines(file_data)
Attachment:
src.diff.xz
Description: application/xz