Fortran Interface + CUDA C Call - 3D Array.

Hi,

I have some doubts in my program.

my sample code is,

Fortran interface that calls CUDA C function.



CALL h_fld(HXS,EYS,EZS)

HXS, EYS, EZS are of real8 type and of size NXNY*NZ.

CUDA C:

#include <cuda.h>
#include <math.h>
#include <cutil.h>
#include “cutil_inline.h”

const double DTMDY=1.532517053655797E-03;
const double DTMDZ=1.532517053655797E-03;

const int NX=32;
const int NY=32;
const int NZ=32;

global void ker_hxs(double *HXS, double *EYS, double *EZS)
{
int I,J,K,AR;

for(K=0; K<NZ; K++)
for(J=0; J<NY; J++)
for(I=1; I<NX; I++)
{
AR=I+JNY+kNX*NY;
HXS[AR]= HXS[AR]-(EZS[AR+NX]-EZS[AR])DTMDY+(EYS[AR+(NXNY)]-EYS[AR])*DTMDZ;
}
}

extern “C” void h_fld_( double *HXS,double *EYS, double *EZS)
{
double d_a,d_b, d_c;
size_t SF= NX
NY
NZ
sizeof(double);

cudaMalloc( (void **) &d_a, SF );
cudaMalloc( (void **) &d_b, SF );
cudaMalloc( (void **) &d_c, SF );

cudaMemcpy( d_a, HXS, SF, cudaMemcpyHostToDevice );
cudaMemcpy( d_b, EYS, SF, cudaMemcpyHostToDevice );
cudaMemcpy( d_c, EZS, SF, cudaMemcpyHostToDevice );

dim3 block(4,4);
dim3 grid(1,1);

ker_hxs<<<grid,block>>>(d_a,d_b,d_c);

cudaMemcpy(HXS,d_a,SF,cudaMemcpyDeviceToHost);

cudaFree((void **) d_a);
cudaFree((void **) d_b);
cudaFree((void **) d_c);
}

With this program I got correct result. But performance wise very slow (compared to Serial coding).
and my question is, “How do we declare threads for 3D arrays”. (tried cudaMemcpy3D also)
And one more question: If I go more than 2 Blocks or more than 32 threads, o/p is abnormal values.

thanks in advance.

On a quick inspection your interface to Fortran looks ok.

However, your kernel is missing the main point of GPU computing: Instead of splitting the computation into small subtasks which are then divided between the threads, each thread repeats the complete computation.

As each thread is executing slower than on a typical CPU and the advantage in throughput is only achieved through the large number of threads, this only slows down the computation.

Furthermore, as HXS is repeatedly modified this way, it also leads to wrong results.

Read the CUDA C Programming Guide for how to make your task suitable for GPU computation.

Hi tera,

ty for fast reply.

I agree with you.

My prog requirements are diff. And My actual Fortran code look like

NX1=NX-1;
NY1=NY-1;
NZ1=NZ-1;

 DO 30 K=1,NZ1
    DO 20 J=1,NY1
      DO 10 I=2,NX1
        
         HXS(I,J,K)=HXS(I,J,K)-(EZS(I,J+1,K)-EZS(I,J,K))*DTMDY+ (EYS(I,J,K+1)-EYS(I,J,K))*DTMDZ

10 CONTINUE
20 CONTINUE
30 CONTINUE

You mean, splitting for loop?

Like…

AR= I+ NXJ+ NXNY*K; // To access 3D Array

if(AR>0 && AR<NX*NY)

 HXS[AR]= HXS[AR]-(EZS[AR+NX]-EZS[AR])*DTMDY+(EYS[AR+(NX*NY)]-EYS[AR])*DTMDZ;

else

 HXS[AR]= HXS[AR]-(EZS[AR+NX]-EZS[AR])*DTMDY+(EYS[AR+(NX*NY)]-EYS[AR])*DTMDZ;

Thanks in advance