[PATCH] Fast csum_partial_copy_generic and more

From: kumon@flab.fujitsu.co.jp
Date: Fri May 19 2000 - 09:20:27 EDT

  • Next message: Alan Cox: "Re: suid GUI apps"

    Hi,

    Here is a patch to speedup csum_partial_copy_generic() on i686 SMP
    upto 50%, and also I added some analysis of further optimization from
    the view point of SMP cache behavior.

    [Patch Summary]
    Attatched patch optimizes csum_partial_copy_generic().

    Measurement using the WEB-BENCH, the consumption time at the function
    is reduced by 33%, so 1.5 times faster than the original.

    [Background of csum_partial_copy_generic]
     In the funcion of i686 code, long-word transfer are unfolded 16
    times, one loop copies exactly 64 bytes, except the beginning and the
    final fraction processing.
     From the observation of the csum_partial_copy_generic behavior, it
    produces lots of cache misses, and this is the main reason of
    slowness.

    [How]
     To accelerate the function, use dummy read as pre-fetch. Of course
    pre-fetching must be done only within the accessible area. And the
    top word of a cache block should be prefetched, because the CPU gets
    the requested word first when the cache is miss-hit, the first word is
    always needed earlier than the other word in the block.

    To keep track to the block top word, a new pointer is added, which
    always points the first word of the cache block which contains the
    63th byte of a source block. This pointer never points outside of the
    source region, so it is safe to read. This patch doesn't make sense
    if the CPU is not a super-scalar type execution.

    Strictly speaking, this prefetch may read just after source regionn at
    most 3 byte. But it never causes trouble, because this excessive area
    and the last transfered byte reside in a same cache block.

    [Benchmarking Result]

    We used Web-bench as a workload, and measure three versions of
    csum_partial_copy_generic.

    Version synopsis,
    2.3.99-pre8-base: the original kernel. but SLAB_POISONING is disabled
            to compare performance to older kernels. but the older data
            is not attached.
    2.3.99-pre8-AS: Add "Artur Skawina <skawina@geocities.com>" patch.
            This pache is also attached at the end of mail.
    2.3.99-pre8-Pf: Add my pre-fetch patch.
            This patch also included in this mail.

    We obtained the following profile.

    The number is average consumption time (unit is us) for one
    web-transaction processing.

    Machine is 4 SMP Xeon 450MHz 2MB with 2GB, w/o HIMEM,
    so actually only1 GB is recognized.

    2.3.99-pre8-Pf
            2.3.99-pre8-AS
                    2.3.99-pre8-base
    990.3 1023.8 1019.3 TOTAL(OS)
     64.9 58.2 54.4 default_idle
     24.8 21.8 20.8 cpu_idle

     63.4 94.8 98.3 csum_partial_copy_generic
     84.2 82.8 82.2 stext_lock
     76.0 76.8 77.7 boomerang_interrupt
     36.6 36.5 36.9 boomerang_rx
     33.4 33.6 34.3 boomerang_start_xmit
     28.6 29.0 29.4 schedule
     22.4 24.2 24.1 kmalloc
     21.5 23.5 22.0 kfree
     19.5 20.2 19.9 tcp_v4_rcv
     18.0 18.4 18.3 wait_for_completion
     17.9 18.3 17.9 __kfree_skb
     16.7 16.5 15.7 __wake_up
     11.0 11.3 10.8 do_IRQ
            rest dropped

    Csum_partial_copy_generic becomes 98.3us->63.4us by using Pf patch, so
    the patch gained 1.5 times speedup.

    Unfortunately, AS version does not show a significant gain. If the
    cache is hit,it may show some advantage. But unfortunately, in the
    current execution environment, the patch is difficult to hide
    cache-miss latency.

    By using the user-land benchmark, the new patch also reduce time even
    when the source operand is aligned at the cache boundary.

    By applying the patch, stext_lock re-appear to the top of time
    consumption race. Last time stext_lock lost its position by the
    poll() kernel-lock avoidance.

    **
     The following shows the break down of stext_lock at tahe prefetched
    version. Do_close() is the current worst spin-lock waiter. Do_close()
    locks the lock relatively longer time. It tighten the bottleneck by
    itself.

            us from where lockvar
            63.4 TOTAL
    --------------------------------
            12.1 do_close+144 0xc025b080<-kernel_lock
            8.71 sys_fcntl+142 0xc025b080
            7.65 schedule+1684 0xc025b080
            6.27 sys_open+72 0xc025b080
            6.07 _fput+27 0xc025b080
            5.97 old_mmap+301 0xc025b080
            5.53 sys_newstat+30 0xc025b080
            3.42 tcp_v4_rcv+610 0x2c(%ebx)
            2.95 boomerang_start_xmit+239 0x188(%ebx)
            1.09 tcp_accept+38 0x2c(%esi)
            0.65 wait_for_connect+621 0x2c(%ebp)

    At this point, there are some different approaches to reduce the
    kernel overhead.

    1. Shorten the kernel-lock region in do_close() or others.
    2. Reduce cache misses in csum_partial_copy_generic().
    3. Use other kind of NIC, because boomerang_interrupt() do too many
       in/out's those add heavy overhead.
       As we've measured eepro100, speedo_interrupt and related functions
       need only a half execution time of boomerang_interrupt and others.

    I think, the second point, miss-reduction, is very important.
    At first, a user program (apache) provides the network data, then if
    csum_partial_copy_generic runs on the same CPU, cache shoud be hit.
    Another measurement shows massive 2nd-cache miss-hit occurs at
    csum_partial_copy_generic for data load, but why?
    The time reduction by the pre-fetching also supports the occurence of
    miss-hit.

    I assume, the data-provider: caller of write() or send(), and the
    data-consumer: csum_partial_copy_generic() are not running on the same
    CPUs. If it is true, we can reduce overhead by forcing it on the same
    CPU, but I have no idea, how??

    The rest is PATCH.

    ------------------
    Prefetch version. (Pf)
    ------------------

    diff -rc linux-2.3.99-pre8/arch/i386/lib/checksum.S linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S
    *** linux-2.3.99-pre8/arch/i386/lib/checksum.S Thu Mar 23 07:23:54 2000
    --- linux-2.3.99-pre8-Pf/arch/i386/lib/checksum.S Mon May 15 22:45:41 2000
    ***************
    *** 394,419 ****
              movl ARGBASE+8(%esp),%edi #dst
              movl ARGBASE+12(%esp),%ecx #len
              movl ARGBASE+16(%esp),%eax #sum
    ! movl %ecx, %edx
              movl %ecx, %ebx
              shrl $6, %ecx
              andl $0x3c, %ebx
              negl %ebx
              subl %ebx, %esi
              subl %ebx, %edi
              lea 3f(%ebx,%ebx), %ebx
              testl %esi, %esi
              jmp *%ebx
      1: addl $64,%esi
              addl $64,%edi
              ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
              ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
              ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
              ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
      3: adcl $0,%eax
              dec %ecx
              jge 1b
    ! 4: andl $3, %edx
              jz 7f
              cmpl $2, %edx
              jb 5f
    --- 394,425 ----
              movl ARGBASE+8(%esp),%edi #dst
              movl ARGBASE+12(%esp),%ecx #len
              movl ARGBASE+16(%esp),%eax #sum
    ! # movl %ecx, %edx
              movl %ecx, %ebx
    + movl %esi, %edx
              shrl $6, %ecx
              andl $0x3c, %ebx
              negl %ebx
              subl %ebx, %esi
              subl %ebx, %edi
    + lea -1(%esi),%edx
    + andl $-32,%edx
              lea 3f(%ebx,%ebx), %ebx
              testl %esi, %esi
              jmp *%ebx
      1: addl $64,%esi
              addl $64,%edi
    + SRC(movb -32(%edx),%bl) ; SRC(movb (%edx),%bl)
              ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
              ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
              ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
              ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
      3: adcl $0,%eax
    + addl $64, %edx
              dec %ecx
              jge 1b
    ! 4: movl ARGBASE+12(%esp),%edx #len
    ! andl $3, %edx
              jz 7f
              cmpl $2, %edx
              jb 5f

    ------------------
    Artur Skawina patch. (AS)
    ------------------
    diff -urNp /img/linux-2.3.99pre6pre5/arch/i386/lib/checksum.S linux-2.3.99pre6pre5as/arch/i386/lib/checksum.S
    --- /img/linux-2.3.99pre6pre5/arch/i386/lib/checksum.S Wed Mar 29 20:53:25 2000
    +++ linux-2.3.99pre6pre5as/arch/i386/lib/checksum.S Sat Apr 22 10:43:28 2000
    @@ -374,81 +373,119 @@ DST( movb %cl, (%edi) )
     
     /* Version for PentiumII/PPro */
     
    +/*
    + This is
    + o 70% slower when the source is not 32 bit aligned [ie (long)src&3]
    + o 190% slower when the destination is not 32 bit aligned
    + o 260% slower when both source and destination are not 32 bit aligned
    + o 175% slower when destination is not 64 bit aligned and source _is_ [ie (long)dst&4]
    + o whether source is 64 bit aligned or not does not seem to make much difference
    + */
    +
     #define ROUND1(x) \
    - SRC(movl x(%esi), %ebx ) ; \
    - addl %ebx, %eax ; \
    - DST(movl %ebx, x(%edi) ) ;
    + SRC(movl x(%esi), %edx ) ;\
    + addl %edx, %eax ;\
    + SRC(movl x+4(%esi), %ebx ) ;\
    + DST(movl %edx, x(%edi) ) ;\
    + adcl %ebx, %eax ;\
    + DST(movl %ebx, x+4(%edi) ) ;\
     
     #define ROUND(x) \
    - SRC(movl x(%esi), %ebx ) ; \
    - adcl %ebx, %eax ; \
    - DST(movl %ebx, x(%edi) ) ;
    + SRC(movl x(%esi), %edx ) ;\
    + adcl %edx, %eax ;\
    + SRC(movl x+4(%esi), %ebx ) ;\
    + DST(movl %edx, x(%edi) ) ;\
    + adcl %ebx, %eax ;\
    + DST(movl %ebx, x+4(%edi) ) ;\
    +
    +#define ROUNDL(x) \
    + SRC(movl x(%esi), %edx ) ;\
    + adcl %edx, %eax ;\
    + SRC(movl x+4(%esi), %ebx ) ;\
    + adcl %ebx, %eax ;\
    + DST(movl %edx, x(%edi) ) ;\
    + DST(movl %ebx, x+4(%edi) ) ;\
     
     #define ARGBASE 12
                     
     csum_partial_copy_generic:
             pushl %ebx
    - pushl %edi
    + movl ARGBASE+12-4*2(%esp),%ebx #len
             pushl %esi
    - movl ARGBASE+4(%esp),%esi #src
    - movl ARGBASE+8(%esp),%edi #dst
    - movl ARGBASE+12(%esp),%ecx #len
    - movl ARGBASE+16(%esp),%eax #sum
    - movl %ecx, %edx
    - movl %ecx, %ebx
    - shrl $6, %ecx
    - andl $0x3c, %ebx
    + movl ARGBASE+4-4*1(%esp),%esi #src
    + movl %ebx, %ecx
    + pushl %edi
    + movl ARGBASE+8-4*0(%esp),%edi #dst
    + andl $0x38, %ebx
    + addl %ebx, %esi
    + shrl $6, %ecx # len /= 64 (number of longwords per iteration)
    + addl %ebx, %edi
             negl %ebx
    - subl %ebx, %esi
    - subl %ebx, %edi
    + movl ARGBASE+16-4*0(%esp),%eax #sum
             lea 3f(%ebx,%ebx), %ebx
    - testl %esi, %esi
    + testl %eax,%eax # CF=0
             jmp *%ebx
    -1: addl $64,%esi
    +1:
    + addl $64,%esi
             addl $64,%edi
    - ROUND1(-64) ROUND(-60) ROUND(-56) ROUND(-52)
    - ROUND (-48) ROUND(-44) ROUND(-40) ROUND(-36)
    - ROUND (-32) ROUND(-28) ROUND(-24) ROUND(-20)
    - ROUND (-16) ROUND(-12) ROUND(-8) ROUND(-4)
    -3: adcl $0,%eax
    + ROUND1(-64) ROUND (-56)
    + ROUND (-48) ROUND (-40)
    + ROUND (-32) ROUND (-24)
    + ROUND (-16) ROUNDL(-8)
    +3:
    + adcl $0,%eax
             dec %ecx
             jge 1b
    -4: andl $3, %edx
    +
    + movl ARGBASE+12(%esp),%edx #len
    +
    + testl $4,%edx
    + jz 4f
    + SRC(movl (%esi), %ebx )
    + addl %ebx, %eax
    + DST(movl %ebx, (%edi) )
    + leal 4(%esi), %esi
    + leal 4(%edi), %edi
    + adcl $0, %eax
    +4:
    + andl $3, %edx
             jz 7f
             cmpl $2, %edx
             jb 5f
     SRC( movw (%esi), %dx )
    - leal 2(%esi), %esi
     DST( movw %dx, (%edi) )
    - leal 2(%edi), %edi
             je 6f
    + leal 2(%esi), %esi
             shll $16,%edx
    + leal 2(%edi), %edi
     5:
     SRC( movb (%esi), %dl )
     DST( movb %dl, (%edi) )
    -6: addl %edx, %eax
    +6:
    + addl %edx, %eax
             adcl $0, %eax
     7:
     .section .fixup, "ax"
    -6001: movl ARGBASE+20(%esp), %ebx # src_err_ptr
    - movl $-EFAULT, (%ebx)
    +6001:
             # zero the complete destination (computing the rest is too much work)
             movl ARGBASE+8(%esp),%edi # dst
             movl ARGBASE+12(%esp),%ecx # len
    + movl ARGBASE+20(%esp), %ebx # src_err_ptr
             xorl %eax,%eax
    + movl $-EFAULT, (%ebx)
             rep; stosb
    - jmp 7b
    + jmp 7b
     6002: movl ARGBASE+24(%esp), %ebx # dst_err_ptr
             movl $-EFAULT, (%ebx)
             jmp 7b
     .previous
     
    - popl %esi
             popl %edi
    + popl %esi
             popl %ebx
             ret
                                     
     #undef ROUND
     #undef ROUND1
    -
    +
     #endif

    --
    Computer Systems Laboratory, Fujitsu Labs.
    kumon@flab.fujitsu.co.jp
    

    - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Fri May 19 2000 - 09:25:35 EDT