Dear Mat,
I have a section of code I am accelerating which looks like this:
!$acc region
!$acc do vector
do i = i_start,i_end
!$acc do parallel
do j = j_start,j_end
.......
The default scheduling the compiler gives me is:
Accelerator kernel generated
278, !$acc do vector(32)
283, !$acc do parallel
Cached references to size [32] block of 'jeven'
Cached references to size [32] block of 'jodd'
CC 1.3 : 117 registers; 1044 shared, 964 constant, 112 local memory bytes; 6 occupancy
My “i” index can reach 270 in value. To my understanding each block on my GPU can launch a maximum of 512 threads.
I want to change the loop scheduling suggested by the compiler so it can launch 270 threads instead of 32.
I tried using the below:
!$acc region
!$acc do vector(270)
do i = i_start,i_end
!$acc do parallel
do j = j_start,j_end
.......
But this causes the compiler to execute the “i” loop in sequence for some reason:
Accelerator kernel generated
278, !$acc do seq
Non-stride-1 accesses for array 'jeven'
Non-stride-1 accesses for array 'jodd'
283, !$acc do parallel
CC 1.3 : 116 registers; 20 shared, 960 constant, 48 local memory bytes; 6 occupancy
Any idea how to get this working?
Thank you for your help.