Re: 3com 3c905c-txm

From: Donald Becker (becker@scyld.com)
Date: Sat May 13 2000 - 00:17:29 EDT

  • Next message: Michael Levitin: "[PATCH] xconfig complaint in 2.3.99-pre8"

    On Sat, 13 May 2000, Andrew Morton wrote:
    > Donald Becker wrote:
    > >
    > > [ big snip. I'll follow up on linux-vortex ]
    > >
    > > 4000 PCI cycles is a *very* long time. Ages. It should only 0 to 2 PCI
    > > cycles to queue a packet.

    To clarify: the "0 to 2 PCI cycles" is the count for ideal hardware, not
    necessarily what we have to work with now. The best case is either
       Just queuing the descriptor, knowing that chip will be looking at the
       descriptor sometime soon. This is possible if you check that
         1) the previously queued packet has not yet been sent, and
         2) the transmit queue has at least two packet waiting.

       A single PCI memory write that wakes the Tx unit up. PCI memory writes
       are very efficient on some systems, since they are queued in a write
       buffer while the CPU continues to work. I/O space writes usually have
       the semantics that they much complete before the processor can do more
       work. That might be over a microsecond, or about a thousand instructions
       on a fast machine.

       Having to read, either I/O or memory, is always expensive.

    > Alas I seem to have lost the ability to reproduce[*].

    You imagined the whole thing.
    Quit eating those mushrooms.

    I know all about not being able to reproduce problems. Some versions of the
    eepro100 chip have a bug where they switch into "broken mode". The hardware
    and driver will work fine for weeks, then something will go wrong
    (presumably with the internal firmware). Despite resetting everything, the
    chip will stop again after sending just a few packets.

    The problem for me is when someone encounters this, makes a driver change,
    and their modified driver works for a week without a problem. They proclaim
    that their new driver is much more reliable, and that they have fixed The
    Bug.

    Usually they haven't fixed anything, or even introduced bugs, but to them
    all evidence points to a successful fix. When I say "that's not a fix",
    they bypass me and submit a patch to Linus. Linus, not knowing the whole
    story, puts the patch in. After all, here is someone Doing Something about
    The Problem, as opposed to Donald which is trying to keep everything a
    secret over on the mailing lists. (I'm trying to minimize what he has to
    deal with, and trying to minimize change points in the mainline kernel.)

    The bottom line is that for a well established code you should establish
    what the actual bug is. That means being able to reproduce it at will, and
    having a good explaination of how it is occuring. Ideally you should
    measure or directly demonstrate what is happening.

    There are things that mask bugs, but don't fix them. Putting in locks, or
    randomly reordering the code frequently has this effect. Locks, especially,
    slow the code down and can reduce the symptom frequency without removing the
    true problem.

    > of three Linux boxes and one NT, the best I can get is 240 loops, in the
    > DownStall in boomerang_start_xmit(). 3c905B. Still much higher than
    > we expect.

    Quick test: histogram of the loop count.

    Donald Becker becker@scyld.com
    Scyld Computing Corporation
    410 Severn Ave. Suite 210
    Annapolis MD 21403

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.rutgers.edu
    Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Sat May 13 2000 - 00:17:47 EDT