Re: SMP kernel lockup, 2.2.14 and 2.2.15pre15

From: Patrick J. LoPresti (patl@cag.lcs.mit.edu)
Date: Fri Mar 31 2000 - 17:37:01 EST

  • Next message: Richard B. Johnson: "Re: How to Determine Which Kernel"

    I have finally reproduced my lockup on 2.2.14 with the IKD patches.
    Here are the backtraces (sans arguments) for the two CPUs as reported
    by kdb.

    Backtrace for CPU 1:

      stext_lock + 0x5bb
      __wait_on_buffer + 0xd9
      sync_block + 0x9f
      sync_direct + 0x22
      ext2_sync_file + 0x4b
      sys_fsync + 0x85

    Backtrace for CPU 0:

      add_timer + 0x3a
      tcp_send_delayed_ack + 0x34
      tcp_delack_timer + 0x3a
      timer_bh + 0x37a
      do_bottom_half + 0x89
      do_IRQ + 0x52
      common_interrupt + 0x18
      do_no_page + 0x42
      handle_mm_fault + 0x107
      do_page_fault + 0x12d
      error_code + 0x2d
      memcpy_toiovec + 0x38
      tcp_recvmsg + 0x377
      inet_recvmsg + 0x72
      sock_recvmsg + 0x37
      sock_read + 0x82
      sys_read + 0xc8

    So one process is calling fsync() and the another is calling read() on
    a TCP socket. It is not obvious to me why this is deadlocked.

    When I do "go" and then hit Pause again, CPU 1 is always stuck at
    exactly the same place. CPU 0 is also exactly the same except for the
    most recent 5 or 6 frames; it seems like I always catch it while
    handling the interrupt and attempting to send the delayed ack, which
    then sets itself up to fire again a little later.

    Note that "do_no_page + 0x42" is the instruction immediately following
    a call to do_anonymous_page. I suspect do_anonymous_page is where I
    am stuck, and the backtrace is being confused by the presence of the
    interrupt. But I am not sure.

    I am hoping a wizard can just look at these backtraces and see the
    problem. Failing that, I would appreciate ideas for what to try next.

    This crash is not easy to reproduce; this time it took almost a week
    of continuously running the offending operations. The program which
    elicits the crash is (unfortunately) commercial, so I do not have the
    source. It runs entirely as an regular user, however, so this is
    definitely a kernel bug.

    I would be glad to provide any additional information (e.g., snippets
    of disassembly) which would be useful.

    Help, please?

     - Pat

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.rutgers.edu
    Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Fri Mar 31 2000 - 17:40:25 EST