experchange > asm.x86

Terje Mathisen (01-03-19, 12:31 PM)
Let's make this a Happy New Year and try to avoid namecalling and abuse.

It is perfectly OK to disagree about stack setups/parameter passing and
local vars, but the way to solve this is by writing working code using
the various styles and then benchmark it: What is faster? What is easier
to write and maintain?

Personally I rather like passing a few parameters in registers,
particularly to leaf functions, but for recursion using the stack is
pretty much always at least as good.

As long as you are using a post-1986 CPU you can use stack-relative
adressing, in which case EBP is perfectly usable as a regular register,
you just need to do the housekeeping/math needed to figure out exactly
where everything is located at all times, while the stack grows ans shrinks.

Terje
Alex McDonald (01-03-19, 05:18 PM)
On 03-Jan-19 10:31, Terje Mathisen wrote:

> Personally I rather like passing a few parameters in registers,
> particularly to leaf functions, but for recursion using the stack is
> pretty much always at least as good.


One thing I haven't seen discussed here is using two stacks. (It might
have been, and I've missed it, so apologies in that case. And yes, this
is very Forth-ish.)

The first stack is for CALL/RET and PUSH/POP (or MOV) pairs for temp
storage, and is based on xSP; the second stack is the classic
parameter/locals frame, and can be based off just about any register
(xBP for instance).

C-type languages use this technique, except the single physical stack
has the two stacks (call/ret on xSP and frame on xBP) interwoven.
Separating them out can make for some interesting efficiencies, like
tail-call elimination, and some significant simplifications.

> As long as you are using a post-1986 CPU you can use stack-relative
> adressing, in which case EBP is perfectly usable as a regular register,
> you just need to do the housekeeping/math needed to figure out exactly
> where everything is located at all times, while the stack grows ans
> shrinks.


Even that can be avoided by using MOV to a fixed negative offset from
xSP instead of PUSH/POP. Then xSP doesn't need to change at all inside
the function.

64 bit Linux has the concept of a "red zone"; the space from [rsp-128]
to [rsp-8]. Windows has a different implementation with a 32 byte
"shadow space" immediately after the return address. Both reduce changes
of RSP during calls.

I can't find the reference (Xmas depleted my google-fu) but there was
some discussion about these mechanisms benefiting performance by
reducing the PUSH/POP activity (inefficient on later Intel chipsets?)
and replacing them with MOVs. I may be misremembering though.
Robert Wessel (01-03-19, 05:38 PM)
On Thu, 3 Jan 2019 15:18:27 +0000, Alex McDonald
<alex> wrote:

[..]
>has the two stacks (call/ret on xSP and frame on xBP) interwoven.
>Separating them out can make for some interesting efficiencies, like
>tail-call elimination, and some significant simplifications.


That was the common implementation on IPF, where the register save
stack was separate (and since all calls left return addresses in a
register, that was where return addresses went). It also prevents
some types of stack-smashing attacks.
Alex McDonald (01-03-19, 10:28 PM)
[retitled]

On 03-Jan-19 15:38, Robert Wessel wrote:

> On Thu, 3 Jan 2019 15:18:27 +0000, Alex McDonald
> <alex> wrote:
> That was the common implementation on IPF, where the register save
> stack was separate (and since all calls left return addresses in a
> register, that was where return addresses went). It also prevents
> some types of stack-smashing attacks.


What is IPF? Apart from stuff about an IPF file format for Atari I can't
get any appropriate Google hits.
Mateusz Viste (01-03-19, 10:46 PM)
On Thu, 03 Jan 2019 20:28:22 +0000, Alex McDonald wrote:
> What is IPF? Apart from stuff about an IPF file format for Atari I can't
> get any appropriate Google hits.




Mateusz
Frank Kotler (01-03-19, 11:54 PM)
On 01/03/2019 05:31 AM, Terje Mathisen wrote:
> Let's make this a Happy New Year and try to avoid namecalling and abuse. This is an official Moderator message:


What Terje said!

Best,
Frank
Robert Wessel (01-04-19, 08:13 AM)
On Thu, 3 Jan 2019 20:28:22 +0000, Alex McDonald
<alex> wrote:

>[retitled]
>On 03-Jan-19 15:38, Robert Wessel wrote:
>What is IPF? Apart from stuff about an IPF file format for Atari I can't
>get any appropriate Google hits.


Intel Itanium. Also IA-64.
R.Wieser (01-04-19, 11:16 AM)
Terje,

> As long as you are using a post-1986 CPU you can use stack-relative
> adressing, in which case EBP is perfectly usable as a regular register,


I know, and I'm sure bernard knows that as well. That was not the issue,
nor that he thought of his own way to solve it (even though it appears to be
just sourcecode eye candy).

What is is not responding to a straight-forward questions (I *still* have
not heard why that "maximum # of arguments" count should be on the stack
for, or if its even there to begin with and not just a mistake by Mike), as
well as remarks toward its general functioning.

I'm a newsgroup citizen for long enough to recognise that kind of behaviour
(refusal to listen or respond to anything that does not support their
particular vision I mean), and rather dislike it (nothing good has ever come
from conversations with such individuals). I guess it shows.

Regards,
Rudy Wieser
Bernhard Schornak (01-04-19, 01:04 PM)
R.Wieser wrote:

> Terje,
>> As long as you are using a post-1986 CPU you can use stack-relative
>> adressing, in which case EBP is perfectly usable as a regular register,

> I know, and I'm sure bernard knows that as well.


Yes. What I called "Intelligent Design" is code for recent
processors, not for hardware shown in museums. I developed
it before iNTEL came up with the less sophisticated, down-
graded version they publish in their 'optimisation guides'
since a couple of years.

And, since I research this in depth for more than a decade
now, I do know (and can prove it experimentally) that this
kind of stack management is faster than abusing rBP. As it
originally was developed for OS/2, it works fast on 32 bit
machines where the APIs/ABIs of all important OS'es follow
C calling conventions. While ID's speed boost was markable
in OS/2's 32 bit environment, it's slightly smaller for 64
bit environments due to the changed parameter passing con-
ventions, but still beats conventional code using RBP. The
"Intelligent Design" text published in my blog was written
in 2005 as part of my former website (down since 2006). It
is on my ToDo list to rewrite this paper completely to add
the experience of the passed 13 years...

Greetings from Augsburg

Bernhard Schornak

P.S.: Dear Rudy,

it was nice if you stop mobbing me. I always kept my posts
as polite as possible and never called you names / treated
you undue or whatever might have caused your current mood,
hatred, ??. To settle this undeclared war, I stop replying
to you after this post (sorry for the fallback, but things
sometimes cannot be left undisputed) and *expect* you stop
mentioning me as long as you don't refer to code or work I
published.
Terje Mathisen (01-04-19, 02:45 PM)
Bernhard Schornak wrote:
> R.Wieser wrote:
> Yes. What I called "Intelligent Design" is code for recent
> processors, not for hardware shown in museums. I developed
> it before iNTEL came up with the less sophisticated, down-
> graded version they publish in their 'optimisation guides'
> since a couple of years.
> And, since I research this in depth for more than a decade
> now, I do know (and can prove it experimentally) that this
> kind of stack management is faster than abusing rBP. As it


This is where you are wrong!

Not specifically, i.e. using a single ESP update followed by MOV is
probably faster on many cpus, but not in general:

Simply because long series of PUSH/POP are so common in both compiler
and assembler code, cpu architects have a lot of good reasons for trying
to make this sort of code significantly faster, and they almost
certainly will do so. (The most obvious hw optimization is to regard
multiple sequential PUSH or POP operations as a single macro op, this is
effectively the same as using explicit MOV operations while avoiding the
code size impact.)

In the meantime you have to ask yourself:

Did you measure these speedups in smaller micro benchmarks, or as part
of a substantial code base? The reason I'm asking is because time and
time again it turns out that smaller code is faster code!

A few PUSH/POP instructions of one byte each saves significant code
space compared with MOV EAX,[ESP+nnn] operations taking at least 3 bytes
each.

Terje
James Harris (01-04-19, 04:58 PM)
On 04/01/2019 12:45, Terje Mathisen wrote:
> Bernhard Schornak wrote:
> This is where you are wrong!


Can those who call for manners please be tactful...?

[..]
> A few PUSH/POP instructions of one byte each saves significant code
> space compared with MOV EAX,[ESP+nnn] operations taking at least 3 bytes
> each.


One thing I took from the earlier discussion is that it's probably worth
having a stack-frame layout where the callee-save registers can either
be pushed or moved into place. Then the best mechanism for a given
target CPU and set of optimisation goals can be used without altering
anything substantial.
Rod Pemberton (01-05-19, 12:56 AM)
On Fri, 4 Jan 2019 10:16:16 +0100
"R.Wieser" <address> wrote:

> What is is not responding to a straight-forward questions (I *still*
> have not heard why that "maximum # of arguments" count should be on
> the stack for, or if its even there to begin with and not just a
> mistake by Mike), as well as remarks toward its general functioning.


Mike tends to post once and go. At least, he's been doing so on
alt.os.development for a few years now.

Anyway, he suggested using ESP only. So, he's omitted using the frame
pointer EBP. He also said no PUSHes or POPs, only MOVs.

How do you find the prior stackframe without having saved it via EBP?
How do you clean up the stack without PUSHes and POPs?

So, he's likely using using the "maximum # of arguments" count to
either manually restore ESP, or using a "RET imm16" instruction for
stack cleanup.

This should be similar to -fomit-frame-pointer for GCC.

Rod Pemberton
R.Wieser (01-05-19, 08:34 AM)
Rod,

> Mike tends to post once and go.


That does explain(?) why bernard jumped in so fast.

> How do you find the prior stackframe without having saved it via EBP?


Easy: By removing the current stackframe by adding a constant (for that
procedure) to ESP. At least, when you use bernards way of doing things.

> So, he's likely using using the "maximum # of arguments" count
> to either manually restore ESP


That count does not even make sense (other than to compare agains to check
for some kind of overflow), as the ammount of *actually* used arguments can
be (way) less.

> or using a "RET imm16" instruction for stack cleanup.


Exactly. And I mentioned that as such in my reply to mike.

Alas, none of this touches my two questions : Why would that value need to
be present, and why does it need to be stored in the stackframe. (mike: "Use
a stack frame which contains .... the maximum number of called function
arguments").

And mind you, this was a response to mikes suggestion to james, which had
zero reference to bernards work.

And for the record: Even now I get what bernard is doing there I do not see
any use for that number to exist (outside of at assemble time), nor why to
store it on the stack. AFAICS its a simple, static value.

Regards,
Rudy Wieser
Bernhard Schornak (01-05-19, 08:59 AM)
Terje Mathisen wrote:

> Bernhard Schornak wrote:
> This is where you are wrong!


Am I? Implies you are an entity empowered to determine what *has
to be* wrong and what *has to be* right...

> Not specifically, i.e. using a single ESP update followed by MOV is probably faster on many cpus,
> but not in general:
> Simply because long series of PUSH/POP are so common in both compiler and assembler code, cpu
> architects have a lot of good reasons for trying to make this sort of code significantly faster, and
> they almost certainly will do so. (The most obvious hw optimization is to regard multiple sequential
> PUSH or POP operations as a single macro op, this is effectively the same as using explicit MOV
> operations while avoiding the code size impact.)


Have a look at AMD and iNTEL optimisation guides regarding write
combining (WC). Both tell you to prefer full 64 byte writes when
you want to gain immense improvements. The WC logic is triggered
by consecutive writes to ascending memory locations. When you're
forcing a WC cycle via

.p2align 5,,31
0:subq $0xF8, %rsp
movq %r15, 0x88(%rsp)
movq %r14, 0x90(%rsp)
movq %r13, 0x98(%rsp)
movq %r12, 0xA0(%rsp)
movq %r11, 0xA8(%rsp)
movq %r10, 0xB0(%rsp)
movq %rbp, 0xB8(%rsp)

it is much faster than

.p2align 5,,31
0:movq %rsp, %rbp
push %r15
push %r14
push %r13
push %r12
push %r11
push %r10
push %rbx # replaces RBP above
subq $0xC0, %rsp

because the PUSHes work downwards, preventing the internal logic
from switching to a WC sequence. If properly coded, the PUSHless
version combines those 7 register saves into one write sequence,
updating the corresponding cache line in one gulp. The last pro-
cessor generations all have optimised prefetch mechanisms, so it
is no problem to feed them with lengthy instructions, as long as
the code to be executed internally can be reduced to simple exe-
cution blocks. A MOVe always is a basic operation, while PUSH or
POP are complex operations (move register or data and update rSP
before / after moving).

> In the meantime you have to ask yourself:


Do I? Sorry, but I hate psycho-games... ;) Your sentence implies
"You are wrong and I am right!". If you mean that, just write it
down as clear text instead of beating around the bush.

> Did you measure these speedups in smaller micro benchmarks, or as part of a substantial code base?
> The reason I'm asking is because time and time again it turns out that smaller code is faster code!


My test suite for OS/2 (until 2009) was

.align 2,0x90
_test:subl $0x80, %esp
nop
nop
movl %edx, 0x68(%esp)
movl %ecx, 0x6C(%esp)
movl %ebx, 0x70(%esp)
movl %edi, 0x74(%esp)
movl %esi, 0x78(%esp)
movl %ebp, 0x7C(%esp)
...
do something to let the load/store unit retire
...
movl 0x68(%esp), %edx
movl 0x6C(%esp), %ecx
movl 0x70(%esp), %ebx
movl 0x74(%esp), %edi
movl 0x78(%esp), %esi
movl 0x7C(%esp), %ebp
addl $0x80, %esp
ret

versus

.align 2,0x90
_test:movl %esp, %ebp
push %edx
push %ecx
push %ebx
push %edi
push %esi
push %ebp
...
do something to let the load/store unit retire
...
pop %edx
pop %ecx
pop %ebx
pop %edi
pop %esi
pop %ebp
leave
ret

tested on Athlon and Phenom and for 64 bit Windows

.p2align 5,,31
_test:subq $0xF8, %rsp
movq %r15, 0xA0(%rsp)
movq %r14, 0xA8(%rsp)
movq %r13, 0xB0(%rsp)
movq %r12, 0xB8(%rsp)
movq %r11, 0xC0(%rsp)
movq %r10, 0xC8(%rsp)
movq %rbx, 0xD0(%rsp)
movq %r9, 0xD8(%rsp)
movq %r8, 0xE0(%rsp)
movq %rdx, 0xE8(%rsp)
movq %rcx, 0xF0(%rsp)
...
do something to let the load/store unit retire
...
movq 0xA0(%rsp), %r15
movq 0xA8(%rsp), %r14
movq 0xB0(%rsp), %r13
movq 0xB8(%rsp), %r12
movq 0xC0(%rsp), %r11
movq 0xC8(%rsp), %r10
movq 0xD0(%rsp), %rbx
movq 0xD8(%rsp), %r9
movq 0xE0(%rsp), %r8
movq 0xE8(%rsp), %rdx
movq 0xF0(%rsp), %rcx
addq $0xF8, %rsp
ret

versus

.p2align 5,,31
_test:movq %rsp, %rbp
push %r15
push %r14
push %r13
push %r12
push %r11
push %r10
push %rbx
push %r9
push %r8
push %rdx
push %rcx
subq $0xC0, %rsp
...
do something to let the load/store unit retire
...
pop %r15
pop %r14
pop %r13
pop %r12
pop %r11
pop %r10
pop %rbx
pop %r9
pop %r8
pop %rdx
pop %rcx
leave
ret

tested on Phenom, Bulldozer and Ryzen.

Turned out the two NOPs at the beginning (initially thought as a
filler for the remaining execution pipes) were superfluous. Both
(32 and 64 bit) tests show advantages for the PUSHless versions.
The gain for the 32 bit version is larger, so I guess that iNTEL
and AMD 'tuned' their schedulers to detect PUSH/POP sequences to
apply some exeptional handling.

To keep timing errors low, some code (lasting few clocks) should
be executed between register saving and restoring. Forcing reads
following writes to the same cache line causes avoidable penalty
cycles.

> A few PUSH/POP instructions of one byte each saves significant code space compared with MOV
> EAX,[ESP+nnn] operations taking at least 3 bytes each.


AFAIK, modern processors handle instructions with up to 17 byte.
Assuming a 32 byte instruction cache line size, sequences with 3
or 4 (REX prefix for 64 bit version of the 'old' registers) byte
shouldn't be a problem for recent prefetch mechanisms - register
pressure introduced by frequent rSP updates (forced via PUSH/POP
instructions) will slow down the machine much more than a larger
instruction size.

Besides that, it is much easier to keep a fixed stackframe alive
for the runtime of each function: No complex calculations of the
current positions of rSP, no extra rSP updates, fast replacement
of function parameters passed to callees, updating parameters is
reduced to changing ones, et cetera. Putting it all together, it
is the decent replacement for the outdated standard method using
rBP as the second register for one single task - addressing data
on the stack and managing the stack itself.

The 'codebase' is a 24 MB function library plus 486 MB programs.
All of them follow the "Intelligent Design" rulework, where PUSH
and POP only are used to write and read processor flags.

Did anyone mention the MMX/XMM/YMM/ZMM registers, yet? AFAIK, no
PUSH or POP instruction exists for them. Either you mix PUSH/POP
with MOV instructions plus rSP updates, or you use a consistent,
easy to use rulework to manage the stack without contortions...

Greetings from Augsburg

Bernhard Schornak
Terje Mathisen (01-07-19, 09:46 AM)
Bernhard Schornak wrote:
> Terje Mathisen wrote:
> Am I? Implies you are an entity empowered to determine what *has
> to be* wrong and what *has to be* right...


Sorry, we're getting into an argument because English isn't the native
language of either of us. :-(

The statement above was meant to be taken together with the following:
>> Not specifically, i.e. using a single ESP update followed by MOV is
>> probably faster on many cpus, but not in general:


I.e. I _accept_ that you have been measuring speedups!

Further down here it does seem like you have been doing these
measurements as isolated micro benchmarks, not by modifying gcc/clang to
generate different function prolog/epilog code for a big application.

What I tried to say is that this type of optimization (which is very
similar to the memcpy() / memmove() optimizations that have happened
over the last 10-15 years) is quite often worthwhile when a single
instance is called many times from many locations, but that it is far
harder to come up with something which also improves inline code, i.e.
when those big prolog/epilog sequences are repeated over 1000's of
functions.

> Have a look at AMD and iNTEL optimisation guides regarding write
> combining (WC). Both tell you to prefer full 64 byte writes when
> you want to gain immense improvements. The WC logic is triggered
> by consecutive writes to ascending memory locations. When you're
> forcing a WC cycle via


This is yet another temporary target! I.e. likely to change between cpu
generations.

If you push enough data on the stack that you have to worry about actual
memory transfer rates, instead of just the L1 cache, then you probably
have more problems. WC logic is only important when you stream data
between memory buffers, in which case you should probably use even wider
(non-temporal) vector stores instead.

At one point in time there was a micro benchmark operating on big arrays
where the fastest implementation (on an AMD?) was to first prefetch
about 4KB of data into L1 cache by actually loading one byte from each
cache line (no prefetch hint operations!), then do the desired
operations in place reading and writing the same 4KB, before finally
using NT stores to copy the 4KB out to the target buffer.

Each line worth of data was touched 6 times instead of just 2 but the
final code ran 2-3 times faster. The key was that this code was
completely RAM limited so you could do arbitrary amounts of work inside
the cpu as long as the RAM chips were accessed in optimal patterns.

OK?

Terje

Similar Threads