Okay, as my spamming of this board in recent days has shown, I’m trying to explore asynchronous memory movement with CUDA Fortran. Since my code involves moving 1, 2, and 3D arrays, I’ve been looking at those API calls. Now, 1 and 2, I get, and I’ve asked for any help with 3D arrays, but I thought I’d move on and try to decompose my 3D arrays into 2D arrays and pass those on.
Now, this isn’t fun (SELECT CASE in the kernel and lots of copy-and-paste), but doable. However, when I try to do this I get:
> make
pgfortran -V10.5 -Mcuda=keepgpu,keepbin,keepptx,maxregcount:64,nofma -Kieee -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -DFLXY -DDEG4 -c src/sorad.cudafor.flxdn.cudaapi.cuf
/tmp/pgnvda2rcamlNvqJn.nv4(0): Error: Formal parameter space overflowed in function soradcuf
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (src/sorad.cudafor.flxdn.cudaapi.cuf: 2620)
PGF90/x86-64 Linux 10.5-0: compilation aborted
make: *** [sorad.cudafor.flxdn.cudaapi.o] Error 2
So, I’m guessing I hit a limit on the number of parameters one can pass to device kernels? It’s possible since I had to move 5 3D arrays into 29 2D arrays, meaning I now have 56 (yes, 56) inputs and outputs in my kernel call.
I suppose my query is: is this limit PGI-specific, or have I run into something in hard-coded into CUDA itself?
If the latter, is there a way around this limit that doesn’t involve cudaMemcpy3D et al? (Say, passing a TYPE of arrays…though I’m not sure how to allocate that…)