Re: HT bug in some Intel CPUs ?

To: Manuel Bouyer <bouyer%antioche.eu.org@localhost>
Subject: Re: HT bug in some Intel CPUs ?
From: Greg Oster <oster%cs.usask.ca@localhost>
Date: Thu, 06 Aug 2009 08:55:39 -0600

Manuel Bouyer writes:
> Hi,
> after fighting with a upgrade from NetBSD-3 to NetBSD-5/i386 of two
> identical  servers, I came to the conclusion that hyperthreading is
> broken on this CPU, causing corrupted registers or memory reads
> (I couldn't determine which).
> The CPU is:
> cpu0: Intel (686-class), 3000.22 MHz, id 0xf4a
> cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
> cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
> cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
> cpu0: features2 641d<SSE3,MONITOR,DS-CPL,CID,xTPR>
> cpu0: features3 20100000<EM64T>
> cpu0: "Intel(R) Xeon(TM) CPU 3.00GHz"
> cpu0: I-cache 12K uOp cache 8-way
> cpu0: L2 cache 2 MB 64B/line 8-way
> cpu0: ITLB 4K/4M: 64 entries
> cpu0: DTLB 4K/4M: 64 entries

Interesting... the CPUs in the box I'm having grief upgrading from 
NetBSD-3 to NetBSD-5/i386 look like this:

cpu0: Intel (686-class), 3000.35 MHz, id 0xf41
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 641d<SSE3,MONITOR,DS-CPL,CID,xTPR>
cpu0: features3 20100000<EM64T>
cpu0: "Intel(R) Xeon(TM) CPU 3.00GHz"
cpu0: I-cache 12K uOp cache 8-way
cpu0: L2 cache 1 MB 64B/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: using thermal monitor 1
cpu0: calibrating local timer
cpu0: apic clock running at 200 MHz
cpu0: 32 page colors

They don't show up as hyperthreaded in 3.0, but do in 5.0.1.

> I'll resume my debug session: from symptoms I came to the conclusion that
> ci_ilevel was maybe not restored properly or corrupted.
> I added some checks to splraiseipl() and splx(), including in splx():
>                                 if ((int)x < 0 || (int)x >= NIPL) { \
>                                         printf("splx(%d)\n", (int)x); \
>                                         panic("splx()"); \
>                                 } \
> 
> This does fire quite fast after some activity (within minutes). x did have
> -1 in the instance where I did print x's value (in previous attempts this
> was just a KASSERT).
> splx() was always called from mutex_vector_exit() via MUTEX_SPIN_SPLRESTORE()
> .
> looking at the lock value from ddb, mtxs_ipl did have the right value.
> The other CPU was always in the process of aquiring a lock.
> To me it looks like a hardware bug in the bus-locked operations which
> cause adjacent values to appear corrupted to the other CPU, maybe for
> a short time. Another possibility is register corrution between the 2
> threads.
> 
> Both server are stable with a kernel using only one CPU (but HT still enabled
> in BIOS).
> 
> Did someone else notice something similar, or have informations about
> such bug ?

I'd like to think that whatever issue you're seeing is the same one I 
have... Would disabling hyperthreading help at all to provide a 
datapoint?  (After an uptime of 330+ days, having 3 hangs in a week 
isn't giving me warm, fuzzy feelings :-/ The machine in question 
is an IBM x336 that has been rock-stable under 3.0)

Later...

Greg Oster

Follow-Ups:
- Re: HT bug in some Intel CPUs ?
  - From: Manuel Bouyer

References:
- HT bug in some Intel CPUs ?
  - From: Manuel Bouyer

Prev by Date: HT bug in some Intel CPUs ?
Next by Date: Re: HT bug in some Intel CPUs ?
Previous by Thread: HT bug in some Intel CPUs ?
Next by Thread: Re: HT bug in some Intel CPUs ?
Indexes:

Home | Main Index | Thread Index | Old Index