tech-repository archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: The first step away from CVS



Joerg Sonnenberger wrote:
On Thu, Jan 07, 2010 at 08:44:58AM +1300, Lloyd Parkes wrote:
I've been having a look at Git and Mercurial and I've worked out
that the first thing we need to do (regardless of what we do second)
is to convert all the CVS commit messages to UTF-8.

The far majority of all commit messages are either ASCII or Latin1.

Certainly the majority of commit messages are US-ASCII, but I expect that the remainder contain a good variety of character sets. Our Japanese colleagues can be quite prolific. The bulk of each message is US-ASCII, but the committers seem to write their names using their local character sets from time to time.

Ignoring the few remaining cases is not that problematic.

Ignored? How? Some versioning systems require that the character set be identified and the two character sets you just mentioned do not cover the whole 8 bit range, so not only would we get some things incorrectly encoded as Latin-1, but we will also get commit messages that cannot be encoded (values 0x7f to 9f are not valid Latin-1 characters).

IMHO the versioning systems that do require that the character set be identified are better designed and less hackish than the others.

I have spent maybe 15 years running IMAP servers, and I used to run some moderately large ones and this problem came up there from time to time before MIME became endemic. In my experience there is no substitute for getting character sets right from the beginning and I think we now have an opportunity to sort this out while we still have a repository that is amenable to dirty tricks.

Cheers,
Lloyd


Home | Main Index | Thread Index | Old Index