Subject: bin/21645: Localized comments and indent(1)
To: None <gnats-bugs@gnats.netbsd.org>
From: None <mishka@terabyte.com.ua>
List: netbsd-bugs
Date: 05/22/2003 18:57:53
>Number:         21645
>Category:       bin
>Synopsis:       indent(1) doesn't handle non English characters in comments
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    bin-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Thu May 22 15:59:01 UTC 2003
>Closed-Date:
>Last-Modified:
>Originator:     Mishka
>Release:        NetBSD 1.6R
>Organization:
Terabyte ACS
>Environment:
System: NetBSD batraq.anything3d.com 1.6R NetBSD 1.6R (BATRAQ) #0: Fri Apr 25 14:37:48 EEST 2003 mishka@batraq.anything3d.com:/usr/home/mishka/netbsd/src-current/sys/arch/i386/compile/BATRAQ i386
Architecture: i386
Machine: i386
>Description:

	Greetings!

	The indent(1) is a very excelent tool for ugly code
	normalization, but currently it doesn't handle correctly
	non English character in comments. Sometimes it "eat" they,
	and sometimes it splits comment line very non intelligent.
	English text passed perfectly.

	This appears so as in some conditions to determine character
	class we have did a comparison of "signed" char against
	some numerical value using implicit clauses, i.e. "<", ">",
	"<=", ">=". In some codesets many characters placed in
	second half of extendend ASCII table and that comparison
	became incorrect.

	For example, in Russain language (KOI8-R) the "A" letter
	have an <E1> code, which is greater that <7F> (last character
	in first half of ASCII table). And *signed* variable "foo"
	compared to, say, space (code <20>) will be less than
	space!!!

		char foo;
		foo = 0xe1;	/* Cyrillic A */
		if (foo > 0x20)
			print("Is character.\n");
		else
			print("Is control.\n");

	The program above give us "Is control."

	Generally, this not indent(1) only problem. Moreover, the
	problem will apears on non comments too (fortunately, the
	C text itself doesn't allow non English characters).

>How-To-Repeat:

	You can create any C text with comments contained characters
	above first ASCII table half and then run indent on it.
	Please note: to reproduce this effect indent must have deal
	with splitting long lines, and better many times at one
	comment.

>Fix:

	Please use the patch below. In this case setlocale() is
	not really needed, but if any functions like isalnum()
	appears, it should be enabled. Maybe in some localizations
	the blank chars not ' ' and '\t' only, who knows?


Index: indent.c
===================================================================
RCS file: /cvsroot/src/usr.bin/indent/indent.c,v
retrieving revision 1.13
diff -u -r1.13 indent.c
--- indent.c	2002/05/26 22:53:38	1.13
+++ indent.c	2003/05/22 15:08:16
@@ -61,6 +61,7 @@
 #include <stdlib.h>
 #include <string.h>
 #include <unistd.h>
+#include <locale.h>
 #define EXTERN
 #include "indent_globs.h"
 #undef  EXTERN
@@ -104,6 +105,8 @@
         |		      INITIALIZATION		      |
         \*-----------------------------------------------*/
 
+	if (!setlocale(LC_ALL, ""))
+		fprintf(stderr, "indent: can't set locale.\n");
 
 	hd_type = 0;
 	ps.p_stack[0] = stmt;	/* this is the parser's stack */
Index: pr_comment.c
===================================================================
RCS file: /cvsroot/src/usr.bin/indent/pr_comment.c,v
retrieving revision 1.7
diff -u -r1.7 pr_comment.c
--- pr_comment.c	2002/05/26 22:53:38	1.7
+++ pr_comment.c	2003/05/22 15:08:31
@@ -47,6 +47,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <ctype.h>
 #include "indent_globs.h"
 
 /*
@@ -184,7 +185,7 @@
 
 	while (1) {		/* this loop will go until the comment is
 				 * copied */
-		if (*buf_ptr > 040 && *buf_ptr != '*')
+		if (!iscntrl(*buf_ptr) && *buf_ptr != '*')
 			ps.last_nl = 0;
 		CHECK_SIZE_COM;
 		switch (*buf_ptr) {	/* this checks for various spcl cases */
@@ -376,7 +377,8 @@
 			/* remember we saw a blank */
 
 			++e_com;
-			if (now_col > adj_max_col && !ps.box_com && unix_comment == 1 && e_com[-1] > ' ') {
+			if (now_col > adj_max_col && !ps.box_com && unix_comment == 1
+				&& !iscntrl(e_com[-1]) && !isblank(e_com[-1])) {
 				/*
 				 * the comment is too long, it must be broken up
 				 */
@@ -399,7 +401,7 @@
 				}
 				*e_com = '\0';	/* print what we have */
 				*last_bl = '\0';
-				while (last_bl > s_com && last_bl[-1] < 040)
+				while (last_bl > s_com && iscntrl(last_bl[-1]) )
 					*--last_bl = 0;
 				e_com = last_bl;
 				dump_line();


	--
	Best regards,
	Mishka.
>Release-Note:
>Audit-Trail:
>Unformatted: