Dear all,
I have calculation on 2D array like this:
do i=1,ni
do j=1,nj
calculation with array(j,i) and array(j+1,i)
end do
end do
This looks fine on CPU as the first array index is the inner loop variable.
But if I run this code on GPU like this:
tid=threadidx%x
bid=blockidx%x
i=(bid-1)*blockdim%x+tid
do j=1,nj
calculation with array(j,i) and array(j+1,i)
end do
Then it looks bad because the first array index is not the thread index.
Do you have suggestions on how I can optimize this code? In my code normally ni is large (~ thousands) but nj is small (< 100).
Thanks,
Lam[/code]
Hi Lam,
For best performance you want the threads to access data sequentially along the stride-1 dimension (in Fortran this is the column). Otherwise memory divergence occurs.
If you can, you want to swap the i and j indices, ex. array(i,j).
If your host array must be arranged as “(j,i)” then consider creating a temp host array with a “(i,j)” index, transform “(j,i)” to “(i,j)”, copy the temp array to the device, perform the computation, copy it back, then transform it back to “(j,i)”. This may or may not be better depending if the extra cost of host computation is less than the extra performance you receive but having stide-1 data access.
If array was read only, you could also take advantage of texture memory which has faster random memory access, but is read only.
Hope this helps,
Mat
Hi Mat,
Because in my program array(1:nj,i) interact with each other so I thought it’s better to let one thread calculate interaction among one column.
If now I switch i an j to have array (i,j) then my code would be like this:
tid=threadidx%x
bid=blockidx%x
i=(bid-1)*blockdim%x+tid
calculation using row i array(i,1:nj)
Do you think this would be more efficient?
Thanks,
Lam
Because in my program array(1:nj,i) interact with each other so I thought it’s better to let one thread calculate interaction among one column.
On a CPU you’d want that, but because of how threads access memory, you want the thread index to be using the stride-1 dimension.
Do you think this would be more efficient?
The memory access will be more efficient. It doesn’t guarantee that the overall performance will be better but in most cases it does help.