I was wondering what the syntax would be for inline ptx in CUDA Fortran kernels. I’d hope the functionality exists without having to interface to CUDA C. I initially assumed it would be the same as CUDA C, e.g. just call
asm(“prefetch.global.L1 [%0];” : : “r”(var) )
(the above taken from a CUDA C post about prefetching, where I replaced ptr with var)
However, compiling this gives me some syntax errors:
“NVFORTRAN-S-0034-Syntax error at or near ) (reduction.cuf: 254)
0 inform, 0 warnings, 1 severes, 0 fatal for device_reduce_warp_memaccesses_vec4_vectorized_prefetch
I tried finding documentation of this in the CUDA Fortran programming guide, as well as the PTX guide, but no luck.
For context, I wanted to experiment with prefetching, since the above reduction kernel is severely limited by long scoreboard stalls. However, I can see myself playing with inline PTX in other contexts, so I would like to know the syntax in CUDA Fortran.