Hi, I have a Fortran code which I’m trying to parallelize using CUDA. The Fortran code sends 2D arrays to a CUDA C function, which then calls a kernel to operate on them. The modified data are then sent back to the host and returned for use in Fortran.
Currently I’m using cudaMalloc() to allocate the arrays on the device and then using cudaMemcpy() to transfer the data each way between the host/device.
extern "C" void foo( float* et ) float* et_d; cudaMalloc( (void**) &et_d, array_mem_4 ); cudaMemcpy( et_d, et, array_mem_4, cudaMemcpyHostToDevice ); LAUNCH KERNEL cudaMemcpy( et, et_d, array_mem_4, cudaMemcpyDeviceToHost );
Would it be faster to use cudaMallocPitch() to allocate the device storage and then use cudaMemcpy2DToArray() to transfer the data back and forth between the host/device/host? The 2D arrays in Fortran have dimensions that are multiples of 16. I ask since the host arrays are allocated in Fortran and then sent to C.