experchange > fortran

bernard gingold (12-14-18, 05:31 PM)
On Friday, December 14, 2018 at 4:23:29 PM UTC+1, bernard gingold wrote:
> On Friday, December 14, 2018 at 4:00:26 PM UTC+1, Eugene Epshteyn wrote:
> > > complex(kind=8), dimension(64), volatile :: c

> > Why is "c" declared as volatile?

> "c" was declared as volatile in order to force compiler to generate code of computational do-loop.


Follow up message:
Now array "c" is not declared as a volatile and ifort still generates x87 code.

subroutine test()
!DIR$ ATTRIBUTES ALIGN : 64 :: c,a,b
complex(kind=8), dimension(64) :: c
complex(kind=8), dimension(64) :: a
complex(kind=8), dimension(64) :: b
integer(kind=4) :: i
integer(kind=4), parameter :: length = 64_4
! Exec code ...
c = 0.0_8
do i = 1_4, length
a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If this line of code
! is uncommented ifort as expected

!a(i) = CMPLX(2.71_8,2.71_8)
b(i) = CMPLX(3.14_8,3.14_8)
end do
!c = a + b
c = a / b
! do i = 1_4, length
! c(i) = a(i) / b(i)
! end do
print*, c
end subroutine
bernard gingold (12-14-18, 05:36 PM)
On Friday, December 14, 2018 at 4:31:05 PM UTC+1, bernard gingold wrote:
[..]
> ! end do
> print*, c
> end subroutine


And ifort optimization -O3 assembly output.

movsd QWORD PTR [rsp], xmm4 #17.6
lea rdx, QWORD PTR [1+rax] #14.17
fld QWORD PTR [rsp] #17.6
lea rcx, QWORD PTR [2+rax] #14.17
movsd QWORD PTR [rsp], xmm8 #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
lea rsi, QWORD PTR [3+rax] #14.17
cvtdq2ps xmm10, xmm7 #12.17
fld QWORD PTR [rsp] #17.6
movaps xmm9, xmm10 #12.17
fld st(0) #17.6
lea rdi, QWORD PTR [4+rax] #14.17
fmul st, st(1) #17.6
mov r8, rdi #14.10
unpcklps xmm9, xmm10 #12.17
paddd xmm7, xmm5 #12.10
cvtps2pd xmm11, xmm9 #12.10
faddp st(2), st #17.6
fxch st(1) #17.6
fdivr st, st(3) #17.6
shl rdx, 4 #14.10
movhlps xmm9, xmm9 #12.10
cvtps2pd xmm13, xmm9 #12.10
movups XMMWORD PTR [-16+test_$A.0.1+rdx], xmm11 #12.10
movsd QWORD PTR [rsp], xmm11 #17.6
fld QWORD PTR [rsp] #17.6
unpckhpd xmm11, xmm11 #17.6
fld st(0) #17.6
movsd QWORD PTR [rsp], xmm11 #17.6
fmul st, st(4) #17.6
fxch st(1) #17.6
fmul st, st(3) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmulp st(5), st #17.6
shl rcx, 4 #14.10
fxch st(4) #17.6
faddp st(2), st #17.6
fxch st(1) #17.6
fmul st, st(2) #17.6
fstp QWORD PTR [rsp] #17.6
fxch st(3) #17.6
fmulp st(2), st #17.6
movsd xmm1, QWORD PTR [rsp] #17.6
fxch st(2) #17.6
fsubp st(1), st #17.6
fmulp st(1), st #17.6
fstp QWORD PTR [rsp] #17.6
movsd xmm12, QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm4 #17.6
fld QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm8 #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
movups XMMWORD PTR [-16+test_$A.0.1+rcx], xmm13 #12.10
faddp st(2), st #17.6
fxch st(1) #17.6
fdivr st, st(3) #17.6
movsd QWORD PTR [rsp], xmm13 #17.6
fld QWORD PTR [rsp] #17.6
unpckhpd xmm13, xmm13 #17.6
fld st(0) #17.6
movsd QWORD PTR [rsp], xmm13 #17.6
fmul st, st(4) #17.6
fxch st(1) #17.6
fmul st, st(3) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmulp st(5), st #17.6
unpckhps xmm10, xmm10 #12.17
fxch st(4) #17.6
faddp st(2), st #17.6
cvtps2pd xmm3, xmm10 #12.10
fxch st(1) #17.6
fmul st, st(2) #17.6
fstp QWORD PTR [rsp] #17.6
fxch st(3) #17.6
fmulp st(2), st #17.6
movsd xmm0, QWORD PTR [rsp] #17.6
fxch st(2) #17.6
fsubp st(1), st #17.6
fmulp st(1), st #17.6
fstp QWORD PTR [rsp] #17.6
movsd xmm14, QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm4 #17.6
fld QWORD PTR [rsp] #17.6
movsd QWORD PTR [rsp], xmm8 #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
fld QWORD PTR [rsp] #17.6
fld st(0) #17.6
fmul st, st(1) #17.6
shl rsi, 4 #14.10
faddp st(2), st #17.6
fxch st(1) #17.6
fdivr st, st(3)
bernard gingold (12-14-18, 05:38 PM)
On Friday, December 14, 2018 at 4:36:38 PM UTC+1, bernard gingold wrote:
[..]
> faddp st(2), st #17.6
> fxch st(1) #17.6
> fdivr st, st(3)


gfortran 5.5 output:

lea rdx, [rbx+8]
xor eax, eax
..L6:
movsd xmm1, QWORD PTR [rsp+1552+rax]
movsd xmm2, QWORD PTR [rcx+rax]
movapd xmm4, xmm1
movsd xmm6, QWORD PTR [rsp+528+rax]
movapd xmm5, xmm2
andpd xmm4, xmm0
andpd xmm5, xmm0
movsd xmm3, QWORD PTR [rsi+rax]
ucomisd xmm5, xmm4
jbe .L10
movapd xmm4, xmm1
divsd xmm4, xmm2
mulsd xmm1, xmm4
addsd xmm1, xmm2
movapd xmm2, xmm6
mulsd xmm2, xmm4
addsd xmm2, xmm3
mulsd xmm3, xmm4
subsd xmm3, xmm6
divsd xmm2, xmm1
divsd xmm3, xmm1
movapd xmm1, xmm3
..L5:
movsd QWORD PTR [rbx+rax], xmm2
movsd QWORD PTR [rdx+rax], xmm1
add rax, 16
cmp rax, 1024
jne .L6
Eugene Epshteyn (12-14-18, 07:25 PM)
Have you experimented with various -x options?

bernard gingold (12-14-18, 07:41 PM)
On Friday, December 14, 2018 at 6:25:54 PM UTC+1, Eugene Epshteyn wrote:
> Have you experimented with various -x options?
>


I was using to run my experiments.
I will try to issue x,Qx options now.
Option: xAVX -> No Impact
Option: xCORE-AVX2 -> No Impact
Option: xHASWELL -> No Impact
Option: xSKYLAKE -> No Impact
Steve Lionel (12-14-18, 08:43 PM)
On 12/14/2018 12:41 PM, bernard gingold wrote:
> On Friday, December 14, 2018 at 6:25:54 PM UTC+1, Eugene Epshteyn wrote:
> I was using to run my experiments.
> I will try to issue x,Qx options now.
> Option: xAVX -> No Impact
> Option: xCORE-AVX2 -> No Impact
> Option: xHASWELL -> No Impact
> Option: xSKYLAKE -> No Impact


As I mentioned earlier, the compiler doesn't use AVX instructions if it
thinks that the overhead of starting AVX is worth it. This subroutine is
very short and it involves complex division.

The optimization report (-qopt-report=3) has some additional information:

Begin optimization report for: TEST

Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (TEST) [2] D:\Projects\t.f90(1,12)

Report from: Loop nest, Vector & Auto-parallelization optimizations
[loop, vec, par]

LOOP BEGIN at D:\Projects\t.f90(10,6)
remark #15300: LOOP WAS VECTORIZED
remark #15449: unmasked aligned unit stride stores: 2
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 12
remark #15477: vector cost: 3.000
remark #15478: estimated potential speedup: 4.000
remark #15488: --- end vector cost summary ---
remark #25015: Estimate of max trip count of loop=4
LOOP END

Non-optimizable loops:

LOOP BEGIN at D:\Projects\t.f90(9,6)
remark #15529: loop was not vectorized: volatile assignment was not
vectorized. Try using non-volatile assignment. [ D:\Projects\t.f90(9,6) ]
LOOP END

LOOP BEGIN at D:\Projects\t.f90(19,6)
remark #15529: loop was not vectorized: volatile assignment was not
vectorized. Try using non-volatile assignment. [ D:\Projects\t.f90(18,9) ]
LOOP END

Report from: Code generation optimizations [cg]

D:\Projects\t.f90(18,9):remark #34046: complex divide implemented using
x87 instructions to maintain precision.
D:\Projects\t.f90(18,9):remark #34048: consider using
complex-limited-range option to boost run time performance.
D:\Projects\t.f90(1,12):remark #34051: REGISTER ALLOCATION : [TEST]
D:\Projects\t.f90:1
bernard gingold (12-14-18, 09:24 PM)
On Friday, December 14, 2018 at 7:43:19 PM UTC+1, Steve Lionel wrote:
[..]
> Twitter: @DoctorFortran
> LinkedIn:
> Blog:


Thanks for your response.
Volatile statement was later removed see post 4:36 PM
Maintaining precision at cost of 180 x87 and AVX instructions!! ... sic.
When setting precision of complex data type to single-precision x87 instructions were eliminated.

Steve you pasted this compiler output (see below)
I presume that the output below is related to simple data initialization loop
or to array syntax.
LOOP BEGIN at D:\Projects\t.f90(10,6)
[..]
Thomas Koenig (12-15-18, 04:44 PM)
bernard gingold <beniekg> schrieb:
[..]
> ! end do
> print*, c
> end subroutine


FWITW, here is what a recent gfortran makes of this code,
with -Ofast -fverbose-asm -S -march=native on a Ryzen 7:

..L2:
# a.f90:11: a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If
# this line of code
vxorps %xmm0, %xmm0, %xmm0 # tmp105
# a.f90:15: b(i) = CMPLX(3.14_8,3.14_8)
vmovsd %xmm2, 1632(%rsp,%rax) # tmp128, MEM[symbol: b, index: ivtmp.58_34, offset: 0B]
vmovsd %xmm1, 1640(%rsp,%rax) # tmp127, MEM[symbol: b, index: ivtmp.58_34, offset: 0B]
# a.f90:11: a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If
# this line of code
vcvtsi2ss %edx, %xmm0, %xmm0 # i, tmp105, tmp105
vcvtss2sd %xmm0, %xmm0, %xmm0 # tmp105, _2
# a.f90:10: do i = 1_4, length
incl %edx # i
# a.f90:11: a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If
# this line of code
vmovddup %xmm0, %xmm0 # _2, tmp107
vmovaps %xmm0, 608(%rsp,%rax) # tmp107, MEM[symbol: a, index: ivtmp.58_34, offset: 0B]
addq $16, %rax #, ivtmp.58
# a.f90:10: do i = 1_4, length
cmpl $65, %edx #, i
jne .L2 #,
# a.f90:15: b(i) = CMPLX(3.14_8,3.14_8)
xorl %eax, %eax # ivtmp.41
.p2align 4
.p2align 3
..L3:
vmovaps 608(%rsp,%rax), %xmm0 # MEM[symbol: a, index: ivtmp.41_33, offset: 0B], vect__46.12
vmovaps 1632(%rsp,%rax), %xmm5 # MEM[symbol: b, index: ivtmp.41_33, offset: 0B], vect__9.8
vpermilpd $1, %xmm0, %xmm4 #, vect__46.12, vect__46.17
vpermilpd $3, %xmm5, %xmm1 #,, vect__9.21
vpermilpd $0, %xmm5, %xmm2 #,, vect__9.9
vmovaps %xmm5, (%rsp) # vect__9.8, %sfp
# a.f90:18: c = a / b
vmulpd %xmm1, %xmm1, %xmm3 # vect__9.21, vect__9.21, vect_powmult_47.25
vmulpd %xmm4, %xmm1, %xmm1 # vect__46.17, vect__9.21, vect__45.22
vmovaps %xmm0, %xmm4 # vect__46.12, vect__44.23
vfmadd132pd %xmm2, %xmm1, %xmm4 # vect__9.9, vect__45.22, vect__44.23
vfmsub132pd %xmm2, %xmm1, %xmm0 # vect__9.9, vect__45.22, vect__44.24
vfmadd231pd %xmm2, %xmm2, %xmm3 # vect__9.9, vect__9.9, vect__12.27
vmovsd %xmm4, %xmm0, %xmm0 # vect__44.23, vect__44.24, tmp115
vdivpd %xmm3, %xmm0, %xmm0 # vect__12.27, tmp115, vect__54.28
vmovaps %xmm0, (%rbx,%rax) # vect__54.28, MEM[symbol: c, index: ivtmp.41_33, offset: 0B]
addq $16, %rax #, ivtmp.41
cmpq $1024, %rax #, ivtmp.41
jne .L3 #, .f90:22: print*, c
gah4 (12-16-18, 02:20 AM)
On Thursday, December 13, 2018 at 1:04:58 PM UTC-8, Steve Lionel wrote:

(snip)

> I'm going to suggest that you are wasting your time with this. The
> compiler does a pretty good job of estimating performance gains, and can
> also keep track of a lot more if you're not using intrinsics. On the
> topic of AVX in particular, there is a penalty for starting to use AVX
> that can be heavy if you have sparse uses of AVX, as the processor will
> go "in and out" of AVX mode. The Intel compiler understands this and
> won't use AVX unless it sees a clear advantage.


I suspect you are right, and I also don't know the OP's reason
for doing it, but sometimes you do things just to see that they
can be done. That is, just for the fun of it.

Doing things for fun usually isn't a waste of time, because
the whole purpose is fun.

You might also say that doing crossword puzzles or soduko
is a waste of time, as nothing useful results.

> I am, in general, not in favor of "micro-optimization", especially if
> you haven't done the profiling and analysis to see if the code you're
> attacking has a meaningful impact. Just because AVX instructions exist
> that doesn't mean you will always benefit from their use.


Yes. Well, I haven't followed it quite close enough, but
it would be interesting to look at Tensorflow in regards to
AVX instructions. The idea, as well as I know it, is to
describe the array operation (at any rank) and let it optimize
the code for doing that operation.

In the OP's case, there will be a lot of subroutine call overhead,
and as noted load/store overhead. One might arrange routines
to do more combined operations (multiply add), or even
operations like dot product or matrix multiply, optimized
using specific instructions. (Or, if someone already did that,
look into improving what is already done.)
robin.vowels (12-16-18, 02:47 AM)
On Saturday, December 15, 2018 at 2:31:05 AM UTC+11, bernard gingold wrote:
[..]
> do i = 1_4, length
> a(i) = CMPLX(real(i,kind=8),real(i,kind=8))!<--- If this line of code
> ! is uncommented ifort as expected


If you are expecting the result to be of double precision
(based on your explicit conversion of the arguments to double),
you will be disappointed.
The result is ALWAYS of SINGLE precision, UNLESS the KIND of the result
is also specified.

> !a(i) = CMPLX(2.71_8,2.71_8)
> b(i) = CMPLX(3.14_8,3.14_8)


Ditto for this line.
[..]
bernard gingold (12-16-18, 10:38 PM)
On Sunday, December 16, 2018 at 1:47:16 AM UTC+1, robin....@gmail.com wrote:
> On Saturday, December 15, 2018 at 2:31:05 AM UTC+11, bernard gingold wrote:
> If you are expecting the result to be of double precision
> (based on your explicit conversion of the arguments to double),
> you will be disappointed.
> The result is ALWAYS of SINGLE precision, UNLESS the KIND of the result
> is also specified.
> Ditto for this line.


My only question was -- Why Intel Compiler generated circa. 180 machine code instructions for complex division?
It is true , that I did not experiment with the compiler floating-point precision switches and what I was seeing could be
precise and slow implementation of complex division.
Now using an option: -O3, -fp-model fast=2 eliminated x87 code which was generated probably to enable
proper handling of potential overflow and/or catastrophic cancellation stemming from complex division formulae.
P.s.
DCMPLX was used.
This is the result:

mov eax, DWORD PTR [-112+rbp] #19.1
movsxd rax, eax #19.1
imul rax, rax, 16 #19.1
mov edx, offset flat: test_$A #19.1
add rdx, rax #19.1
add rdx, -16 #19.1
movupd xmm0, XMMWORD PTR [rdx] #19.1
mov eax, DWORD PTR [-112+rbp] #19.1
movsxd rax, eax #19.1
imul rax, rax, 16 #19.1
mov edx, offset flat: test_$B #19.1
add rdx, rax #19.1
add rdx, -16 #19.1
movupd xmm1, XMMWORD PTR [rdx] #19.1
movapd xmm2, xmm1 #19.1
unpcklpd xmm2, xmm1 #19.1
mulpd xmm2, xmm0 #19.1
shufpd xmm0, xmm0, 1 #19.1
movapd xmm3, xmm1 #19.1
unpckhpd xmm3, xmm1 #19.1
mulpd xmm0, xmm3 #19.1
xorpd xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #19.1
addpd xmm2, xmm0 #19.1
mulpd xmm1, xmm1 #19.1
movapd xmm0, xmm1 #19.1
shufpd xmm0, xmm1, 1 #19.1
addpd xmm1, xmm0 #19.1
divpd xmm2, xmm1 #19.1
mov eax, DWORD PTR [-112+rbp] #19.13
movsxd rax, eax #19.13
imul rax, rax, 16 #19.13
mov edx, offset flat: test_$C #19.1
add rdx, rax #19.1
add rdx, -16 #19.13
movupd XMMWORD PTR [rdx], xmm2 #19.1
mov eax, 1 #20.1
add eax, DWORD PTR [-112+rbp] #20.1
mov DWORD PTR [-112+rbp], eax #20.1
mov eax, DWORD PTR [-112+rbp] #20.1
cmp eax, 64 #20.1
jle ..B1.7
Thomas Koenig (12-16-18, 11:38 PM)
bernard gingold <beniekg> schrieb:

> My only question was -- Why Intel Compiler generated circa. 180
> machine code instructions for complex division?


I cannot speak for Intel, but complex division is complex (pun
intended).

Look at the pseudo-code generated with gfortran for a simple statement

c = a / b

where

<bb 2> :
_9 = REALPART_EXPR <*a_5(D)>;
_10 = IMAGPART_EXPR <*a_5(D)>;
_1 = COMPLEX_EXPR <_9, _10>;
_11 = REALPART_EXPR <*b_6(D)>;
_12 = IMAGPART_EXPR <*b_6(D)>;
_2 = COMPLEX_EXPR <_11, _12>;
_13 = ABS_EXPR <_11>;
_14 = ABS_EXPR <_12>;
_15 = _13 < _14;
if (_15 != 0)
goto <bb 4>; [50.00%]
else
goto <bb 5>; [50.00%]

<bb 4> :
_16 = _11 / _12;
_17 = _11 * _16;
_18 = _12 + _17;
_19 = _9 * _16;
_20 = _10 + _19;
_21 = _10 * _16;
_22 = _21 - _9;
_23 = _20 / _18;
_24 = _22 / _18;
_38 = _23;
_39 = _24;
goto <bb 3>; [100.00%]

<bb 5> :
_25 = _12 / _11;
_26 = _12 * _25;
_27 = _11 + _26;
_28 = _10 * _25;
_29 = _9 + _28;
_30 = _9 * _25;
_31 = _10 - _30;
_32 = _29 / _27;
_33 = _31 / _27;
_40 = _32;
_41 = _33;

<bb 3> :
# _36 = PHI <_38(4), _40(5)>
# _37 = PHI <_39(4), _41(5)>
_3 = COMPLEX_EXPR <_36, _37>;
_34 = _36;
_35 = _37;
REALPART_EXPR <*c_7(D)> = _34;
IMAGPART_EXPR <*c_7(D)> = _35;

(This is from the *cplxlower0 dump, which you will
get with -fdump-tree-cplxlower0 )
bernard gingold (12-17-18, 10:00 AM)
On Sunday, December 16, 2018 at 10:38:48 PM UTC+1, Thomas Koenig wrote:
[..]
> IMAGPART_EXPR <*c_7(D)> = _35;
> (This is from the *cplxlower0 dump, which you will
> get with -fdump-tree-cplxlower0 )


Yes fully agree with you.
It seems to me, that compilers (ifort) wil generate safe and slow implementation and unsafe and fast(er) implementation.
The difference lies in handling of possibly unsafe computations.
In regards to gfortran I can not see a two different implementations of complex division. The disassembly follows strictly a standard formula for complex division.

Without optimization (removing index calculations)
movapd xmm4, xmm0
divsd xmm4, xmm1
mulsd xmm0, xmm4
addsd xmm0, xmm1
movapd xmm1, xmm3
mulsd xmm1, xmm4
addsd xmm1, xmm2
mulsd xmm2, xmm4
subsd xmm2, xmm3
divsd xmm1, xmm0
divsd xmm2, xmm0
movapd xmm0, xmm2
With optimization: -O3

movapd xmm3, xmm1
divsd xmm3, xmm0
mulsd xmm1, xmm3
addsd xmm0, xmm1
movapd xmm1, xmm3
mulsd xmm1, xmm5
addsd xmm1, xmm2
mulsd xmm2, xmm3
divsd xmm1, xmm0
bernard gingold (12-17-18, 10:04 AM)
On Sunday, December 16, 2018 at 10:38:48 PM UTC+1, Thomas Koenig wrote:
[..]
> IMAGPART_EXPR <*c_7(D)> = _35;
> (This is from the *cplxlower0 dump, which you will
> get with -fdump-tree-cplxlower0 )


Out of curiosity I checked implementation of std::complex on g++ and it seems that this safe builting function handles complex division even when -O3 is set

COMPILER_RT_ABI double _Complex
__divdc3(double __a, double __b, double __c, double __d)
{
int __ilogbw = 0;
double __logbw = crt_logb(crt_fmax(crt_fabs(__c), crt_fabs(__d)));
if (crt_isfinite(__logbw))
{
__ilogbw = (int)__logbw;
__c = crt_scalbn(__c, -__ilogbw);
__d = crt_scalbn(__d, -__ilogbw);
}
double __denom = __c * __c + __d * __d;
double _Complex z;
__real__ z = crt_scalbn((__a * __c + __b * __d) / __denom, -__ilogbw);
__imag__ z = crt_scalbn((__b * __c - __a * __d) / __denom, -__ilogbw);
if (crt_isnan(__real__ z) && crt_isnan(__imag__ z))
{
if ((__denom == 0.0) && (!crt_isnan(__a) || !crt_isnan(__b)))
{
__real__ z = crt_copysign(CRT_INFINITY, __c) * __a;
__imag__ z = crt_copysign(CRT_INFINITY, __c) * __b;
}
else if ((crt_isinf(__a) || crt_isinf(__b)) &&
crt_isfinite(__c) && crt_isfinite(__d))
{
__a = crt_copysign(crt_isinf(__a) ? 1.0 : 0.0, __a);
__b = crt_copysign(crt_isinf(__b) ? 1.0 : 0.0, __b);
__real__ z = CRT_INFINITY * (__a * __c + __b * __d);
__imag__ z = CRT_INFINITY * (__b * __c - __a * __d);
}
else if (crt_isinf(__logbw) && __logbw > 0.0 &&
crt_isfinite(__a) && crt_isfinite(__b))
{
__c = crt_copysign(crt_isinf(__c) ? 1.0 : 0.0, __c);
__d = crt_copysign(crt_isinf(__d) ? 1.0 : 0.0, __d);
__real__ z = 0.0 * (__a * __c + __b * __d);
__imag__ z = 0.0 * (__b * __c - __a * __d);
}
}
return z;
}
bernard gingold (12-17-18, 06:16 PM)
On Sunday, December 16, 2018 at 10:38:48 PM UTC+1, Thomas Koenig wrote:
[..]
> IMAGPART_EXPR <*c_7(D)> = _35;
> (This is from the *cplxlower0 dump, which you will
> get with -fdump-tree-cplxlower0 )


Thomas do you plan to implement floating-point safe version of complex division?
As far as I was able to investigate 'gfortran' at least on Intel CPU generates always the same machine code.
By looking at your example I see, that only guard against the zero is implemented.
Thanks in advance.

Similar Threads