I’m trying to compare the performance of CUDA C and CUDA Fortran because we have a large Fortran code base and want to avoid porting it to C. So far I’ve been able to optimize most aspects of CUDA Fortran but as soon as I add a well placed #pragma unroll to the C code the Fortran codes falls behind.
E.g. this kernel code in CUDA C is unrolled:
uint end = min(gpu_shared_mem_block_size, n - block_offset); #pragma unroll 16 for (uint j = 0; j < end; j++) sum += block_a[threadIdx.y][j] * block_b[j][threadIdx.x];
And I’m wondering if there is a way to do the same in this CUDA Fortran code:
block_end = min(16, n - shared_block_offset) do k = 1, block_end sum = sum + A_shared(threadidx%x, k) * B_shared(k, threadidx%y) end do
I’ve searched the documentation and have found “!$pgi unroll” and “!$acc unroll”. It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something?
Thanks in advance.