Hi! The current conversion infrastructure I packaged in wip/cvs2hg using devel/py-hg-fastimport expects log messages to be in UTF-8 encoding. (An earlier version of py-hg-fastimport fell back to ISO-8859-1 if a string was not convertible as UTF-8.) For this reason I've already been fixing some encoding problems in the other repositories I've been converting. For src/ this became too much to be done by hand, because there were numerous imports and other commits that touched a lot of files and were in ISO-8859-1 encoding. So I wrote a script for that. It parses RCS files, only looking at the commit messages, and tries to parse each line as UTF-8. If this doesn't work, it parses it as ISO-8859-1. If there were any changes, it writes the file out in UTF-8 encoding. (I originally tried converting each log message as a whole, but there is at least one import that has a single log message containing characters in both UTF-8 and ISO-8859-1; that's why I switched to the line-based approach.) When looking at the generated changes, I noticed that there were some other encodings used (e.g. ISO-8859-2). So I fixed up those entries as I noticed them. The attached script, when run on the src repository from earlier this month, generates a diff that looks good to me. I've attached it. (I have not tried converting the resulting repository because my main machine is not stable enough right now for a conversion taking more than one week, like src does on it.) Please review the script and/or the diff. I'd like admins to run it on src/ when we're happy with it. Cheers, Thomas
#!/usr/bin/env python3 # # pylint recognize some variables incorrectly as constants # pylint: disable=C0103 import argparse def end_of_log(log_lines): '''Try decoding as utf-8 first, then iso-8859-1''' decoded = [] changes = False for log_line in log_lines: try: decoded_line = log_line.decode('utf-8') if args.verbose: print(f'utf-8: ${decoded_line}') except UnicodeDecodeError: decoded_line = log_line.decode('iso-8859-1') changes = True if args.verbose: print(f'iso-8859-1: ${decoded_line}') decoded.append(decoded_line) result = [] for dec_line in decoded: line = dec_line line = line.replace('Doleèek', 'Doleček') line = line.replace('"access"âand', '"access" and') line = line.replace("Hereâs", "Here's") line = line.replace("donât", "don't") line = line.replace("Iâd", "I'd") line = line.replace("itâs", "it's") line = line.replace("canât", "can't") line = line.replace("theyâve", "they've") if line != dec_line: changes = True result.append(line.encode('utf-8')) return changes, result def read_file(f): in_body = False in_log = False at_count = 0 lines = f.readlines() output = [] changes = False for line in lines: if args.verbose: print(f'line {line}') if line == b'desc\n': in_body = True if args.verbose: print(f'desc: {line}') output.append(line) elif in_body and line == b'log\n': in_log = True at_count = 0 log = [] if args.verbose: print(f'body: {line}') output.append(line) elif in_log: if args.verbose: print(f'in_log: {line}') at_count += line.count(b'@') log.append(line) if at_count % 2 == 0: more_changes, result = end_of_log(log) if more_changes: changes = True in_log = False output += result else: output.append(line) return changes, output parser = argparse.ArgumentParser(description='Convert encoding of RCS log ' + 'messages from iso-8859-1 to utf-8') parser.add_argument('file', help='RCS file to edit') parser.add_argument('-v', dest='verbose', help='RCS file to edit') args = parser.parse_args() changed = False with open(args.file, 'rb') as file: changed, file_data = read_file(file) if changed: with open(args.file, 'wb') as file: file.writelines(file_data)
Attachment:
src.diff.xz
Description: application/xz