Current-Users archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Old Index]

re: 10.99.9 amd64 panic



Martin Husemann writes:
> On Fri, Sep 29, 2023 at 09:52:42AM +0000, Chavdar Ivanov wrote:
> > Sep 29 01:53:13 ymir /netbsd: [ 228407.9443196] panic: kernel diagnostic assertion "offset < map->dm_mapsize" failed: file "/home/sysbuild/src/sys/arch/x86/x86/bus_dma.c", line 826 bad offset 0x0 >= 0x0
> [..]
> > Sep 29 01:53:13 ymir /netbsd: [ 228407.9543802] bus_dmamap_sync() at netbsd:bus_dmamap_sync+0x326
> > Sep 29 01:53:13 ymir /netbsd: [ 228407.9543802] rge_rxeof() at netbsd:rge_rxeof+0x179
>
> This is a bug in the rge(4) driver (unrelated to userland resource usage
> by the build), maybe a race triggered more easily when the system is
> under heavey load.

hmm, this seems like corruption to me.

> bus_dma.c", line 826 bad offset 0x0 >= 0x0

says that offset == 0 (which is right, this seem to this call):

1241   /* Invalidate the RX mbuf and unload its map. */
1242   bus_dmamap_sync(sc->sc_dmat, rxq->rxq_dmamap, 0,
1243       rxq->rxq_dmamap->dm_mapsize, BUS_DMASYNC_POSTREAD);

offset is the 0 / 3rd arg here, but the *second* 0x0 value here
seems to be corrupted, and shouldn't be zero.  ie, there's no
case where it will create a zero-length dma map, it should always
be either RGE_TX_LIST_SZ, RGE_RX_LIST_SZ, or RGE_JUMBO_FRAMELEN,
so for this assert to trigger saying the passed offset is beyond
the mapping, because the mapping is zero length, seems to be
pretty clear that the bus_dmamap_t has been corrupted.

the timing does seem to indicate that a problem with out of
memory may be relevant here..oh, i think i may see a problem.

1110 rge_newbuf(struct rge_softc *sc, int idx)
...
1126         if (bus_dmamap_load_mbuf(sc->sc_dmat, rxmap, m, BUS_DMA_NOWAIT))
1127                 goto out;  
...
1151 out:            
1152         if (m != NULL)
1153                 m_freem(m);
1154         return (ENOMEM);

so, if bus_dmamap_load_mbuf() fails, we return ENOMEM, not
ENOBUFS.  however, the callers only consider ENOBUFS as an
error case:

1176 rge_rx_list_init(struct rge_softc *sc)
...
1184                 if (rge_newbuf(sc, i) == ENOBUFS)
1185                         return (ENOBUFS);

and

1212 rge_rxeof(struct rge_softc *sc)
...
1271                 if (rge_newbuf(sc, i) == ENOBUFS) {

so in this case, the code thinks a buffer was allocated, but it
wasn't... i haven't gone deeping into what this may cause the
code to do wrong yet, but it seems problematic.

certainly, both callers should check for != 0, not == ENOBUFS,
to avoid this problem.


.mrg.


Home | Main Index | Thread Index | Old Index