Hello there. I’m in the process of porting a relatively large project to CUDA Fortran and I’ve ran into an issue with one of the modules: Even after lots of optimization, it creates too much temporary device memory per thread to be able to run in a grid with more than, say, 64x32 threads. Now this is ok, as long as we can run the rest of the program using more threads, for example 256x256. In order to do this we would like to execute that one module in strides of 64x32 threads serially. Here comes the problem:
I can’t figure out, how to slice device arrays (both intent(in) and intent(out)) in order to pass only a stride of the input/output arrays to the module subroutines. We don’t want to copy to host and back for this, since that would impact performance too much. Here’s what I’ve tried:
- using standard Fortran array slicing notation, such as
call my_module_kernel_wrapper(myInput(strideBegin:strideEnd), myOutput(strideBegin:strideEnd))
outcome: “Profiled program has returned error code 139”. Note: I haven’t yet tried a minimal example as shown above, but I can’t find any information about host error codes. I haven’t run it using a profiler, this is just the message when executing it normally. Does anyone know whether the above notation is supported for device arrays?
- using temporary device arrays (in host code), such as
real(8), dimension(stride), device :: myTemp ... ! index calculation strideBegin, strideEnd myTemp = myInput(strideBegin:strideEnd)
outcome: “More than one device-resident object in assignment”
Apparently device-to-device copying is still not supported. One workaround that comes to mind would be a CUDA C helper function just for the device-to-device copy, but it’s a bit of a hassle (more build steps and/or dependencies) I’d like to avoid if there is a better solution.
Does anyone have a hint on how I can achieve a device array slice without copying to the host and back? Thanks a lot in advance.