port-sparc: Re: -current on Ultra 5+ - now it's major siop0 lossage

Subject: Re: -current on Ultra 5+ - now it's major siop0 lossage
To: None <eeh@netbsd.org>
From: Greg Earle <earle@isolar.DynDNS.ORG>
List: port-sparc
Date: 01/25/2001 16:43:14
>> First I tried the January 14th snapshot kernel.  I still get the rash of
>>  
>>	hme0: invalid packet size 2048; dropping
>> 
>> errors as described previously, when using the Ethernet.  And I started
>> seeing these:
>> 
>> Jan 25 05:05:21 netbsd4me /netbsd: DMA IRQ: bus fault dma fifo empty,
>> DSP=0xc0069fec DSA=0xc006df00: last msg_in=0x0 status=0xff
>> Jan 25 05:05:21 netbsd4me /netbsd: siop0: scsi bus reset
>> Jan 25 05:05:21 netbsd4me /netbsd: cmd 0x1a3ed00 (target 0:0) in reset list
>> Jan 25 05:05:21 netbsd4me /netbsd: cmd 0x1a3e888 (target 0:0) in reset list
>> Jan 25 05:05:21 netbsd4me /netbsd: cmd 0x1a3ed00 (status 2) about to be
>> processed
>> Jan 25 05:05:21 netbsd4me /netbsd: cmd 0x1a3e888 (status 2) about to be
>> processed
>> Jan 25 05:05:21 netbsd4me /netbsd: siop0: target 0 now synchronous at
>> 20.0Mhz, offset 16
> 
>> every once in a while, on disk transfers over the SCSI bus.  (Actually, I
>> checked my messages logs, and I've gotten these ever since using the vanilla
>> 1.5 GENERIC kernel, but back then it only happened once or twice per each
>> reboot cycle.)
> 
> I seem to recall having seen the 2048 packet size problem
> before and that it was related to interrupt latency problems,
> but I may be mistaken.  Paul Kranenburg did the HME driver, 
> so the best thing to do would be to send him email on the subject.

OK, I'm copying Paul.

What I've done in the meantime is to go back to 1.5, and I built a new 1.5
kernel with the "#define HMEDEBUG" turned off in hme.c, as well as the change
of _HME_NDESC from 32 to 128.  Even assuming I'm still dropping lots of
supposedly-2048 packets, with the "hme0: invalid packet size 2048; dropping"
error message removed I'm still getting almost 900 Kbytes/sec transfers from
ftp.NetBSD.ORG.

> As far as the siop problems are concerned, they are new to
> me.  Manuel has added tag queuing to that driver recently,
> so you should make certain he's aware of the problem.

OK, I'm copying Manuel too  :-)

I'm pretty sure I've only seen these (see above) siop0/SCSI bus DMA errors when
booted under a post-1.5 kernel.  Was that when the tagged queueing stuff was
added?  Is there a way to turn it off (I'm willing to try and build another
-current kernel if I can figure out how ... I'd like to go to -current, but
I need stability first and foremost on this system, and right now, with 1.5
and my 2 hme.c changes, I have a sort of stable-but-delicate equilibrium)?

Can one put in a kernel config line with

	sd0 at scsibus? target 0 lun0 flags 0xNNNNNN	# SCSI disks

for systems with the Symbios Logic 53c875-based PCI SCSI boards?

If so, what's the value for 0xNNNNNN?  0x500000?

> Having said all that, this could be an issue with the PCI
> controller.  What sort of machine is this?  So far we have
> mostly been dealing with machines that have UltraSPARC IIi
> processors with the on-board PCI controller.  If you have 
> a machine with an UltraSPARC II and a psycho or psycho+, 
> (first congratulations for having gotten it to boot) 
> you probably have issues with the PCI drivers and the
> IOMMU's streaming buffer cache.  Try disabling it.

I have a plain vanilla 333 MHz Ultra 5+.  Should I "try disabling" the IOMMU's
streaming buffer cache anyway (not that I would know how to do that ... )?

Here's the relevant boot messages:

NetBSD 1.5 (NETBSD4ME) #0: Thu Jan 25 06:51:41 PST 2001
    root@netbsd4me:/usr/src/1.5/sys/arch/sparc64/compile/NETBSD4ME
total memory = 128 MB
avail memory = 86880 KB
using 4075 buffers containing 32600 KB of memory  [Why so much for bufcache??]
bootpath: /pci@1f,0/pci@1,0/scsi@1,0/disk@0,0
mainbus0 (root): SUNW,Ultra-5_10
cpu0 at mainbus0: SUNW,UltraSPARC-IIi @ 333 MHz, version 0 FPU
cpu0: physical 4K instruction (32 b/l), 4K data (32 b/l), 2048K external (64 
b/l)
psycho0 at mainbus0 addr 0xfffc4000
sabre: bus range 0 to 2; simba b, PCI bus 1; simba a, PCI bus 2
DVMA map: c0002000 to ffffe000
pci0 at psycho0
pci0: i/o space, memory space enabled
Sun Microsystems product 0x2000 (miscellaneous prehistoric) at pci0 dev 0 
functi
on 0 not configured
simba0 at pci0 dev 1 function 0: Sun Microsystems Simba PCI bridge (rev. 0x13)
pci1 at simba0 bus 2
pci1: i/o space, memory space enabled
siop0 at pci1 dev 1 function 0: Symbios Logic 53c875 (ultra-wide scsi)
siop0: using on-board RAM
siop0: interrupting at vector 16
scsibus0 at siop0: 16 targets, 8 luns per target
[...]
hme0 at pci2 dev 1 function 1: address 08:00:20:ac:52:d4
nsphy0 at hme0 phy 1: DP83840 10/100 media interface, rev. 1
nsphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto
hme0: using vector 33 for interrupt
[...]
scsibus0: waiting 2 seconds for devices to settle...
sd0 at scsibus0 target 0 lun 0: <QUANTUM, XP39100S, LYK8> SCSI2 0/direct fixed
siop0: target 0 now synchronous at 20.0Mhz, offset 16
sd0: 8682 MB, 5899 cyl, 20 head, 150 sec, 512 bytes/sect x 17781520 sectors
root on sd0a dumps on sd0b
[...]

Oh.  While I'm here.  Two other (non-related) things:

[...]
Starting mountd.
Jan 25 16:15:37 netbsd4me /netbsd: Non-unique normal route, mask not entered
Setting securelevel: kern.securelevel: 0 -> 1
[...]
Updating motd.
Alignment error: dsfsr=00000000:00800001 dsfar=ffffffff:ffff98c4 \
isfsr=00000000:00000000 pc=0x10c340
Starting lpd.
[...]

I don't see anything in "netstat -r" to tell me what the cause of that
"Non-unique normal route" message is.

And what is the "Alignment error:" from?

Thanks, and sorry for asking so many questions.

	- Greg