2.4.0-test6 network socket problems

From: J. Scott Kasten (jsk@tetracon-eng.net)
Date: Fri Oct 13 2000 - 15:26:46 EDT

  • Next message: Richard B. Johnson: "RE: large memory support for x86"

    I'm working with test6 on an embedded QED MIPS arch in big endian mode. I
    have run into some bizarre socket problems that appear to affect both udp
    and tcp transport. Applications actively using sockets (examples, ftp,
    tftp, others...) will unexpectedly stop receiving data on the socket, even
    though data is present. The process will be forever sleeping on the read
    even though data is queued up. To illustrate my point, I've dug deep into
    the udp code (net/ipv4/udp.c) and the datagram core (net/core/datagram.c)
    researching the simple tftp example.

    After much debugging, here is what I know:

    I have followed the packets in from the network driver all the way to
    udp_rcv() in udp.c. I see it do the sk lookup and drop it off in
    udp_queue_rcv_skb(). Everything is fine on that end.

    On the process end, I've been watching in the udp_recvmsg() function, also
    in udp.c. Under normal operation, I see it pick up the data from the
    correct skb and return. When the rare condition that causes failure
    occurs, skb_recv_datagram() returns a NULL and err is set to -ERESTARTSYS.
    It is only when the process gets hung on that socket that I see this
    happen. It never revisits this portion of the code again, however, until
    the sender stops transmitting data from ACK timeouts, I see the packets
    continue to pile up on the udp_rcv() side without incident.

    I further looked at datagram.c to see what the skb_recv_datagram() was
    doing. It was spinning through the do {} while() loop waiting for
    wait_for_packet() to hand it something. It is in that routine that the
    error code is generated. The signal_pending() function returns true
    and sock_intr_errno() returns the -ERESTARTSYS code, which gets passed
    back down the chain from here.

    The structure of the tftp code that I'm working with is such that it does
    a generic blocking read() on the socket file handle and uses an alarm to
    wake up when the critical timeout is reached. Not the most glorious code,
    but demonstrates a problem none-the-less. The read() never returns an
    EINTR or EAGAIN or anything. It's just hung. I'm assuming that the
    signal_pending() return comes from my alarm(), which means that the
    process had already been sitting on that socket for a while not seeing the
    data that is clearly already present. Thus, there may be two problems
    here, the signal not returning, and data trapped in the skb.

    I would appreciate it if anyone more familiar with this code could point
    me better to what I should be looking at, or at least explain what should
    be happening that isn't.

    TIA,

    -S-

    --
    

    J. Scott Kasten Email: jsk AT tetracon-eng DOT net

    "In most cases, all an argument proves is that two people were present......"

    - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Fri Oct 13 2000 - 15:28:34 EDT