tech-userlevel archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

Re: alignement or compiler bug?



>         max_write = ps->ps_max_write - sizeof(*fwi);

>                 data_len = MIN(*resid, max_write);
>                 if (data_len > PAGE_SIZE)
>                         data_len = data_len & ~(PAGE_SIZE - 1);

>         (void)memcpy(fwi + 1, buf + written, data_len);

> This code wil rarely crash in memcpy().  gdb shows an unexpectely
> huge data_len, bigger than max_write.  My explanation this that
> memcpy overwrite data_len becaue (fwi +1) did not hold the expected
> value.

Another possibility: *resid is, like max_write, signed, and is
negative.  (If ps->ps_max_write is less than sizeof(*fwi), max_write
could be negative, but your "bigger than max_write" makes it sound as
though that's not it.)

I'm assuming data_len is of unsigned type.  Try printing it out in hex.
Are the low bits zero?  If not, this increases the plausibility of the
"corruption" theory, because of the clearing of the low bits if it's
larger than PAGE_SIZE.

> Adding an intermediate variable and a test on it makes the problem vanish:

>                 char *data;

>                 /*
>                  * If (fwi + 1) is directly used as a memcpy argument,
>                  * we get a rare strange sigsegv, with data_len corrupted
>                  * by the memcpy() operation itself.
>                  */
>                 data = (char *)(void *)(fwi + 1);
>                 if (data != ((char *)fwi) + sizeof(*fwi))
>                         DERRX(EX_SOFTWARE, "%s:%d\n", __func__, __LINE__);

>                 (void)memcpy(data, buf + written, data_len);

> Is there a subttle memory alignement problem I missed, or can we call
> that a compiler bug?

I'd have to look at the code at the machine/assembly level to be
reasonably confident calling it either one.  I feel quite sure it is
not a data alignment issue at the C abstract machine level; the code
you've added *should* not make any difference there (well, unless
DERR() expands into something unexpected).

Some things you might try:

- Look at the assembly/machine code.  See if it looks broken.  (What
   hardwarwe is this on?  If it's one I know, I can have a look.)

- Leave the "data" variable there, including the code you added to set
   it, but still pass fwi+1 to the memcpy.

- If it doesn't break the semantics, make data have static storage
   duration rather than automatic.

The last two are trying to probe the possibility that it's not the code
you added that's relevant, but the additional space on the stack for
the variable, or the additional code in the text segment.

Another possibility, not sure how to check this one out: that it
happens only when a page fault, TLB miss, or some such occurs at the
wrong moment, and is actually a kernel bug of some sort.

/~\ The ASCII                             Mouse
\ / Ribbon Campaign
 X  Against HTML                mouse%rodents-montreal.org@localhost
/ \ Email!           7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Home | Main Index | Thread Index | Old Index