experchange > fortran

bernard gingold (12-13-18, 11:11 AM)
I decided to write a set of Fortran interfaces to specific part of Intel AVX-AVX3 intrinsics.
Vector intrinsics are wrapped in thin C layer which calls appropriate simd intrinsic:
For your convenience I post a simple code snippet.
BEGIN example:

Compiler: icc 15

C-side

_m256d v256_add_pd(__m256d v1, __m256d v2) {
return(_mm256_add_pd(v1,v2));
}
--------------------------------------------------------------
Compiler: ifort 15

Fortran-side

type, bind(c) :: m256d
real(c_double), dimension(0:3) :: v
end type m256d

interface
type(m256d) function v256_add_pd(v1,v2) bind(c,name='v256_add_pd')
use , intrinsic :: iso_c_binding
import :: m256d
type(m256d), intent(in) :: v1
type(m256d), intent(in) :: v2
end function v256_add_pd
end interface

END example

The code compiles and links fine, but during the runtime interoperability between Fortran code and C code fails providing wrong result.
During the debugging C-side assembly code (debug build) calls vaddpd machine code instruction which returns it's result in YMM0 register.
Fortran-side compiler generated assembly code did not issue a load instruction
to copy a content of YMM0 register to automatic variable of type m256d which should be compatible with C declaration of __m256d struct.
C struct '__m256d' contains interoperable data member: a static array: double v[4] and struct is aligned on 32-byte boundary (defined in immintrin.h provided by Intel compiler).
I suppose that ifort compiler failed to recognize return type of my C wrapper to Intel _mm256_add_pd intrinsic being fully compatible with derived type 'm256d'.
Problem could stem probably from disaggregation of __m256d struct by placing its content inside YMM register.
Louis Krupp (12-13-18, 12:00 PM)
On Thu, 13 Dec 2018 01:11:43 -0800 (PST), bernard gingold
<beniekg> wrote:

[..]
> C struct '__m256d' contains interoperable data member: a static array: double v[4] and struct is aligned on 32-byte boundary (defined in immintrin.h provided by Intel compiler).
>I suppose that ifort compiler failed to recognize return type of my C wrapper to Intel _mm256_add_pd intrinsic being fully compatible with derived type 'm256d'.
>Problem could stem probably from disaggregation of __m256d struct by placing its content inside YMM register.


Stupid question: Does the Fortran function need to declared its
arguments as passed by value?

For the benefit of those of us who aren't 100% sure what the code is
supposed to do (add two 256-bit values?), a small main program with an
indication of the expected result would be helpful.

Louis
bernard gingold (12-13-18, 12:20 PM)
On Thursday, December 13, 2018 at 11:00:52 AM UTC+1, Louis Krupp wrote:
> On Thu, 13 Dec 2018 01:11:43 -0800 (PST), bernard gingold
> <beniekg> wrote:
> Stupid question: Does the Fortran function need to declared its
> arguments as passed by value?
> For the benefit of those of us who aren't 100% sure what the code is
> supposed to do (add two 256-bit values?), a small main program with an
> indication of the expected result would be helpful.
> Louis


Hi Louis,
Thanks for the reply.
I added a keyword 'value' unfortunately it did not help.
The code (C-side) is performing a vector floating-point addition of two __m256d data types.
Please have a look here:

BEGIN example

PROGRAM

!DIR$ ATTRIBUTES ALIGN : 64 :: vres,v1,v2 ! Alignment on 32-byte boundary
type(m256d) :: vres,v1,v2
! Exec code .....

vres.v = 0.0_8
v1.v = 3.14_8
v2.v = 2.71_8

vres = v256_add_pd(v1,v2)
print*, "v256_add_pd: -- result:", vres.v

! Upon the end of execution vres.v shall contain a result of vector
! addition of {3.14+2.71, 3.14+2.71, 3.14+2.71, 3.14+2.71}
! Unfortunately 'vres.v' contain an junk values (stack uninitialized
! memory pattern)

END PROGRAM
bernard gingold (12-13-18, 12:21 PM)
On Thursday, December 13, 2018 at 11:20:30 AM UTC+1, bernard gingold wrote:
[..]
bernard gingold (12-13-18, 01:57 PM)
On Thursday, December 13, 2018 at 11:21:33 AM UTC+1, bernard gingold wrote:
> On Thursday, December 13, 2018 at 11:20:30 AM UTC+1, bernard gingold wrote:


Disassembly of C function:

push rbp #6.17
mov rbp, rsp #6.17
and rsp, -32 #6.17
vmovapd ymm1, YMMWORD PTR .L_2il0floatpacket.1[rip] #10.19
vaddpd ymm2, ymm1, YMMWORD PTR .L_2il0floatpacket.0[rip] #11.12
vxorpd ymm0, ymm0, ymm0 #8.27
vmovapd YMMWORD PTR [-32+rsp], ymm0 #8.25
Result --> vmovapd YMMWORD PTR [-32+rsp], ymm2
mov rsp, rbp #13.1
pop rbp #13.1
ret
Milan Curcic (12-13-18, 04:02 PM)
Won't the compiler use these intrinsics on its own where appropriate, if compiled with -AVX or -AVX2 (I may be misremembering exact flag name)? Why the need for Fortran bindings?
bernard gingold (12-13-18, 04:13 PM)
On Thursday, December 13, 2018 at 3:02:14 PM UTC+1, Milan Curcic wrote:
> Won't the compiler use these intrinsics on its own where appropriate, if compiled with -AVX or -AVX2 (I may be misremembering exact flag name)? Why the need for Fortran bindings?


I'm not sure if Intel Fortran compiler during the compilation phase will translate Fortran array syntax to C intrinsics and later convert it to machine code.
I suppose that some intermediate representation of array syntax will be in use
during the optimization phase and later this representation will be transformed to machine code.

Now in regards to Fortran bindings I would like to have a possibility to manually vectorise the code with the help of SIMD intrinsics just as it is done in the case of C or C++ code.
Myroslav Zapukhlyak (12-13-18, 05:26 PM)
In immintrin.h

__m256 is by default float (i.e. c_float) and not c_double in fortran type definition

typedef float __m256 __attribute__((__vector_size__(32)));
typedef double __m256d __attribute__((__vector_size__(32)));
typedef __int64 __m256i __attribute__((__vector_size__(32)));

On Thursday, December 13, 2018 at 3:13:06 PM UTC+1, bernard gingold wrote:
[..]
Steve Lionel (12-13-18, 06:12 PM)
On 12/13/2018 9:13 AM, bernard gingold wrote:
> I'm not sure if Intel Fortran compiler during the compilation phase will translate Fortran array syntax to C intrinsics and later convert it to machine code.
> I suppose that some intermediate representation of array syntax will be in use
> during the optimization phase and later this representation will be transformed to machine code.


The Intel compilers themselves don't use the "C intrinsics". Rather,
they recognize operations that can be (they think) made faster by use of
the AVX instructions (if you have told the compiler it's ok to use
them.) If you look at the assembly for array-heavy Fortran code you'll
often see the AVX instructions there.

The instruction intrinsics are in C for C programmers who want to use
them specifically. I expect they are used a lot in the math library and
MKL. I would not recommend their use in Fortran programs.
bernard gingold (12-13-18, 06:34 PM)
On Thursday, December 13, 2018 at 5:12:58 PM UTC+1, Steve Lionel wrote:
[..]
> Twitter: @DoctorFortran
> LinkedIn:
> Blog:


Steve thanks for the answer.

Finally I was able to call intrinsics-based code successfully. Declaration of C wrapper had been changed to void function(double * __restrict, double * __restrict, double * __restrict) when in turn Fortran interface was changed to subroutine declaration.
Now everything works correctly and only one thing bothers me -- two load instructions and one store instruction that is the cost for single vector addition. I have had an expectation even before working on that example, that I would be able to keep a result in YMM register and no matter which combination I tried it did not succeeded.
Steve Lionel (12-13-18, 11:04 PM)
On 12/13/2018 11:34 AM, bernard gingold wrote:
> Finally I was able to call intrinsics-based code successfully. Declaration of C wrapper had been changed to void function(double * __restrict, double * __restrict, double * __restrict) when in turn Fortran interface was changed to subroutine declaration.
> Now everything works correctly and only one thing bothers me -- two load instructions and one store instruction that is the cost for single vector addition. I have had an expectation even before working on that example, that I would be able to keep a result in YMM register and no matter which combination I tried it did not succeeded.


I'm going to suggest that you are wasting your time with this. The
compiler does a pretty good job of estimating performance gains, and can
also keep track of a lot more if you're not using intrinsics. On the
topic of AVX in particular, there is a penalty for starting to use AVX
that can be heavy if you have sparse uses of AVX, as the processor will
go "in and out" of AVX mode. The Intel compiler understands this and
won't use AVX unless it sees a clear advantage.

I am, in general, not in favor of "micro-optimization", especially if
you haven't done the profiling and analysis to see if the code you're
attacking has a meaningful impact. Just because AVX instructions exist
that doesn't mean you will always benefit from their use.
bernard gingold (12-14-18, 10:00 AM)
On Thursday, December 13, 2018 at 10:04:58 PM UTC+1, Steve Lionel wrote:
[..]
> Twitter: @DoctorFortran
> LinkedIn:
> Blog:


I work on radar simulation project with heavy usage of complex-representation of radar signals.
I did an investigation of complex arithmetic handled by ifort and it does not look promising, especially when complex division is taken into account.
Ifort eagerly generates scalar x87 code no matter which coding style is used i.e. array syntax or do-loops.
I came to conclusion, that decomposing array of complex numbers into two different streams i.e. real part array and imaginary part array will probablyenable at least usage of SIMD scalar code.
Intersting what is the cost of firing up x87 unit and cost of bypass between SIMD and x87 units.
An example
movups xmm0, XMMWORD PTR [-16+test_$B.0.1+rcx] #21.9
movups xmm1, XMMWORD PTR [-16+test_$A.0.1+rcx] #21.9
movsd QWORD PTR [-8+rsp], xmm0 #21.9
inc sil #22.6
fld QWORD PTR [-8+rsp] #21.9
unpckhpd xmm0, xmm0 #21.9
fld st(0) #21.9
movsd QWORD PTR [-8+rsp], xmm0 #21.9
fmul st, st(1) #21.9
fld QWORD PTR [-8+rsp] #21.9
fld st(0) #21.9
fmul st, st(1) #21.9
movsd QWORD PTR [-8+rsp], xmm1 #21.9
faddp st(2), st #21.9
fxch st(1) #21.9
fdivr st, st(3) #21.9
fld QWORD PTR [-8+rsp] #21.9
unpckhpd xmm1, xmm1 #21.9
fld st(0) #21.9
movsd QWORD PTR [-8+rsp], xmm1 #21.9
fmul st, st(4) #21.9
fxch st(1) #21.9
fmul st, st(3) #21.9
fld QWORD PTR [-8+rsp] #21.9
fld st(0) #21.9
fmulp st(5), st #21.9
fxch st(4) #21.9
faddp st(2), st #21.9
fxch st(1) #21.9
fmul st, st(2) #21.9
fstp QWORD PTR [-8+rsp] #21.9
fxch st(3) #21.9
fmulp st(2), st #21.9
movsd xmm3, QWORD PTR [-8+rsp] #21.9
fxch st(2) #21.9
fsubp st(1), st #21.9
fmulp st(1), st #21.9
fstp QWORD PTR [-8+rsp] #21.9
movsd xmm2, QWORD PTR [-8+rsp] #21.9
unpcklpd xmm3, xmm2 #21.9
movups XMMWORD PTR [-16+test_$C.0.1+rcx], xmm3 #21.9
add rcx, 16 #22.6
cmp sil, 64 #22.6
jle ..B1.6 # Prob 98% #22.6
fstp st(0) #
bernard gingold (12-14-18, 10:03 AM)
On Friday, December 14, 2018 at 9:00:47 AM UTC+1, bernard gingold wrote:
[..]
> cmp sil, 64 #22.6
> jle ..B1.6 # Prob 98% #22.6
> fstp st(0) #


I have forgotten to add source code:
Compiler: IFORT 19.0.0, optimization: -O3
subroutine test()
!DIR$ ATTRIBUTES ALIGN : 64 :: c,a,b
complex(kind=8), dimension(64), volatile :: c
complex(kind=8), dimension(64) :: a
complex(kind=8), dimension(64) :: b
integer(kind=4) :: i
integer(kind=4), parameter :: length = 64_4
! Exec code ...
c = 0.0_8
do i = 1_4, length
! a(i) = CMPLX(real(i,kind=8),real(i,kind=8))
a(i) = CMPLX(2.71_8,2.71_8)
b(i) = CMPLX(3.14_8,3.14_8)
end do
!c = a + b
! c = a / b
do i = 1_4, length
c(i) = a(i) / b(i)
end do

end subroutine
Eugene Epshteyn (12-14-18, 05:00 PM)
> complex(kind=8), dimension(64), volatile :: c

Why is "c" declared as volatile?
bernard gingold (12-14-18, 05:23 PM)
On Friday, December 14, 2018 at 4:00:26 PM UTC+1, Eugene Epshteyn wrote:
> > complex(kind=8), dimension(64), volatile :: c

> Why is "c" declared as volatile?


"c" was declared as volatile in order to force compiler to generate code of computational do-loop.

Similar Threads