Re: Code optimization <LEA Instruction>

From: Jamie Lokier (lkd@tantalophile.demon.co.uk)
Date: Fri Jan 28 2000 - 19:36:59 EST

  • Next message: Zach Brown: "Re: power managing maestro.c 0.14 available."

    Richard B. Johnson wrote:
    > This is from GCC version 2.7.2.3, the exact utility that
    > ../linux/Documentation/Changes requires.

    From another recent thread, we know that Linux 2.3.recent spits out a
    few warnings about cache-aligned structures with GCC 2.7.2.3.

    So I guess none of the people that write that code use GCC 2.7.2.3 :-)

    > In most cases LEA is used where an offset is already known at compile-
    > time. Note the stack operations using LEA and that the "pop" instruction
    > uses the stack-pointer that has just been changed in value using LEA. This
    > is a 3-clock penality.
    >
    > 333: 8d 65 e0 lea 0xffffffe0(%ebp),%esp
    > 336: 5b pop %ebx

    The AGI occurs when address calculation is dependent on the prior
    instruction, in the pipeline. There is no AGI in this example.

    In fact AGIs are precisely because address calculation is done earlier
    than ALU arithmetic, so in this example the use of "lea" actually helps!

    (Aside: %esp is treated specially anyway, so additional rules apply
    where "push" and "pop" are concerned).

    Quite apart from all that, the alternative is three instructions long and
    two bytes longer:

       0: 89 e5 mov %esp,%ebp
       2: 83 c5 e0 add $0xffffffe0,%ebp
       5: 5b pop %ebx

    Everything tells me "lea" is better than "add" here...

    > 667: 8d 65 e0 lea 0xffffffe0(%ebp),%esp
    > 66a: 5b pop %ebx

    etc., etc. all these are smaller and faster with "lea". We win :-)

    > 00000a08 <schedule>:
    > d78: 8d 1c 9d 0d 00 00 00 lea 0xd(,%ebx,4),%ebx
    > d7f: 8d 0c dd 00 00 00 00 lea 0x0(,%ebx,8),%ecx

    These are multiply-accumulate operations. GCC decided that's the
    fastest way to say %ecx = 32 * %ebx + 104. There is an AGI stall as
    there so it would be nice if appropriate scheduling would insert
    something in the middle, but it's still faster than using adds and shifts.

    > dad: 8d 76 00 lea 0x0(%esi),%esi ! NOP

    It's as fast as a nop too, provided nobody wrote to %esi in the
    preceding cycle. It's done that way because it's a faster way to say
    "nop 3 bytes long" for jump target alignment.

    > fda: 8d 5c 1e 01 lea 0x1(%esi,%ebx,1),%ebx
    > fde: 89 5d cc mov %ebx,0xffffffcc(%ebp)

    This means %ebx = %esi + %ebx + 1.
    You'd have to use two instructions to get the same result with "add".

    There is no AGI stall between the two instructions because the
    instruction's address calculation is not dependent on the first. There
    may be an AGI due to stalls between dependent pairs on a Pentium, but
    you haven't shown enough context for me to say.

    You show lots of these. An AGI due to "lea" occurs when you have the
    lea _after_ the instruction that sets a register used in the address.
    On a Pentium, due to pairing, there can be two instructions in between.

    > 1115: 8d 76 00 lea 0x0(%esi),%esi
    > 1118: 8b 71 38 mov 0x38(%ecx),%esi

    That's ok. There's no AGI between these two because they're not dependent.

    > 2e98: 8d 87 1a 03 00 00 lea 0x31a(%edi),%eax
    > 2e9e: 50 push %eax

    The first instruction's equivalent using "add" is two instructions long,
    involving a move and an add. It would be larger code and in most cases
    slower. That's because %edi and %eax are not the same register. "lea"
    is the x86 way of saying "add A to CONSTANT plus B times POWEROF2 and
    store in C". "add" can only do "add A to B".

    > 2f24: 8d 90 08 fb ff ff lea 0xfffffb08(%eax),%edx
    > 2f2a: 29 fa sub %edi,%edx

    There's no AGI here. Max speed if the pairing state is working nicely.

    > 2fe5: 8d 75 ec lea 0xffffffec(%ebp),%esi
    > 2fe8: 56 push %esi
    > 2fe9: 8d 87 ac 04 00 00 lea 0x4ac(%edi),%eax
    > 2fef: 50 push %eax
    >
    > 2ff5: 8d 5d d8 lea 0xffffffd8(%ebp),%ebx
    > 2ff8: 53 push %ebx
    >
    > 2ff9: 8d 87 b4 04 00 00 lea 0x4b4(%edi),%eax
    > 2fff: 50 push %eax
    >
    > 304c: 8d 65 cc lea 0xffffffcc(%ebp),%esp
    > 304f: 5b pop %ebx

    No stalls in any of these either. We still win :-)

    All this, and I believe GCC 2.7.2.3 doesn't even know about AGIs. From
    your examples, it seems to work out nicely without considering them.

    have a radiant day,
    -- Jamie

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.rutgers.edu
    Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Sat Jan 29 2000 - 05:26:31 EST