fork/wait race in 2.4.0-pre?

From: Adam J. Richter (adam@yggdrasil.com)
Date: Sat Dec 23 2000 - 10:32:17 EST

  • Next message: Arjan van de Ven: "Re: "undefined reference" atm_lane_init & atm_mpoa_init with test13-pre4"

            I reported this problem a few months ago in bug-glibc and
    did not get any response, although that is not unexpected since it is
    unclear where the problem is. So that bug report and this report
    will probably serve just to chronicle the problem in case anybody
    sees something similar.

            Anyhow, the problem is that somehow fork or vfork (makes no
    difference) will return an apparently valid pid and then the child
    process will disappear. Calling wait or waitpid will return errno 10
    (ECHILD, "no child process"), and will continue to return errno 10
    if wait or waitpid is called again. I got lucky with some strategically
    placed printf's at a point where this problem sometimes appears and
    was able to determine that, at least when wait() is called, the
    signal handler for SIGCLD (17) is SIG_IGN (1), so it seems less
    likely that some userland facility is reaping the process, especially
    since one of the places where this problem occurs is a very simple
    program that does little more than fork and wait.

            This usually happens during the "configure" phase of our
    build process, which is right after about 2.5GB of sources
    have been extracted from CVS to a directory tree, so there may
    be some IO congestion that could lead to unusual timing relationships,
    leading to unsual results from race conditions. Also, the problem
    started occurring occasionally when the machine in question got
    an 866MHz CPU, and started occuring more often when it got a 1GHz
    CPU. So, more instructions per time slice seems to be a relevant
    factor.

            Anyhow, I know this is a very slippery bug and it may
    be months before it is tracked down either here or elsewhere, but
    I thought it would be helpful to at least document it for the
    linux-kernel archives.

    Adam J. Richter __ ______________ 4880 Stevens Creek Blvd, Suite 104
    adam@yggdrasil.com \ / San Jose, California 95129-1034
    +1 408 261-6630 | g g d r a s i l United States of America
    fax +1 408 261-6631 "Free Software For The Rest Of Us."
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Sat Dec 23 2000 - 11:02:52 EST