Slicing a device array in CUDA Fortran

Hello there. I’m in the process of porting a relatively large project to CUDA Fortran and I’ve ran into an issue with one of the modules: Even after lots of optimization, it creates too much temporary device memory per thread to be able to run in a grid with more than, say, 64x32 threads. Now this is ok, as long as we can run the rest of the program using more threads, for example 256x256. In order to do this we would like to execute that one module in strides of 64x32 threads serially. Here comes the problem:

I can’t figure out, how to slice device arrays (both intent(in) and intent(out)) in order to pass only a stride of the input/output arrays to the module subroutines. We don’t want to copy to host and back for this, since that would impact performance too much. Here’s what I’ve tried:

  1. using standard Fortran array slicing notation, such as
call my_module_kernel_wrapper(myInput(strideBegin:strideEnd), myOutput(strideBegin:strideEnd))

outcome: “Profiled program has returned error code 139”. Note: I haven’t yet tried a minimal example as shown above, but I can’t find any information about host error codes. I haven’t run it using a profiler, this is just the message when executing it normally. Does anyone know whether the above notation is supported for device arrays?

  1. using temporary device arrays (in host code), such as
real(8), dimension(stride), device :: myTemp
... ! index calculation strideBegin, strideEnd
myTemp = myInput(strideBegin:strideEnd)

outcome: “More than one device-resident object in assignment”
Apparently device-to-device copying is still not supported. One workaround that comes to mind would be a CUDA C helper function just for the device-to-device copy, but it’s a bit of a hassle (more build steps and/or dependencies) I’d like to avoid if there is a better solution.

Does anyone have a hint on how I can achieve a device array slice without copying to the host and back? Thanks a lot in advance.

When you use array slice notation as a parameter, the compiler may try to create a temporary array for the slice, and that might be causing you problems.

One solution is to pass the starting address and a length. This is like the old F77 style.

call my_module_kernel_wrapper(myInput(strideBegin), strideEnd-strideEnd+1, myOutput(strideBegin))

declare the dummy arguments in module_kernel_wrapper with * rather than : in the array dimension.

You’re right, we still have some problems supporting all types of device-to-device transfers using array syntax. We’re working to address that in 1H of 2013. For now, you can use the cuda API, so this might work:

istat = cudaMemcpy(myTemp, myInput(strideBegin), strideEnd-strideBegin+1, cudaMemcpyDeviceToDevice)

In effect, we’ve written the C wrappers for you. Just “use cudafor” in the program unit where you do this and the API is available. Note that usually
the count is in units of the datatype, not bytes like it is in C.

Finally, if the striding is sort of complicated, you can code it yourself. CUF kernels might be good for this:

!$cuf kernel do <<< *, * >>>
do i = 1, strideEnd-strideBegin+1
myTemp(i) = myInput(strideBegin+i-1)
end do

This gives you some control over the number of blocks and threads that take part in the copy.

Brent, thank you a ton - your input was very very helpful. At the end I went with the Fortran 77 notation since I’m already passing in the length anyway. It works, even with 2D and 3D arrays (which I didn’t mention above to keep things simple). I might also mention that this implementation is actually part of a framework where we have a hybrid codebase that works for both CPU and GPU - thanks to your help I was able to keep the GPU-only code in the wrapper to a minimum.