CUDA fortran, 2D arrays in loops

lamtnguyen · June 9, 2014, 2:57am

Dear all,

I have calculation on 2D array like this:

do i=1,ni

   do j=1,nj

       calculation with array(j,i) and array(j+1,i)

   end do

end do

This looks fine on CPU as the first array index is the inner loop variable.

But if I run this code on GPU like this:

tid=threadidx%x
bid=blockidx%x
i=(bid-1)*blockdim%x+tid

   do j=1,nj

       calculation with array(j,i) and array(j+1,i)

   end do

Then it looks bad because the first array index is not the thread index.

Do you have suggestions on how I can optimize this code? In my code normally ni is large (~ thousands) but nj is small (< 100).

Thanks,

Lam[/code]

MatColgrove · June 9, 2014, 3:33pm

Hi Lam,

For best performance you want the threads to access data sequentially along the stride-1 dimension (in Fortran this is the column). Otherwise memory divergence occurs.

If you can, you want to swap the i and j indices, ex. array(i,j).

If your host array must be arranged as “(j,i)” then consider creating a temp host array with a “(i,j)” index, transform “(j,i)” to “(i,j)”, copy the temp array to the device, perform the computation, copy it back, then transform it back to “(j,i)”. This may or may not be better depending if the extra cost of host computation is less than the extra performance you receive but having stide-1 data access.

If array was read only, you could also take advantage of texture memory which has faster random memory access, but is read only.

Hope this helps,
Mat

lamtnguyen · June 9, 2014, 11:41pm

Hi Mat,

Because in my program array(1:nj,i) interact with each other so I thought it’s better to let one thread calculate interaction among one column.

If now I switch i an j to have array (i,j) then my code would be like this:

tid=threadidx%x 
bid=blockidx%x 
i=(bid-1)*blockdim%x+tid 

calculation using row i array(i,1:nj)

Do you think this would be more efficient?

Thanks,

Lam

MatColgrove · June 10, 2014, 6:37pm

Because in my program array(1:nj,i) interact with each other so I thought it’s better to let one thread calculate interaction among one column.

On a CPU you’d want that, but because of how threads access memory, you want the thread index to be using the stride-1 dimension.

Do you think this would be more efficient?

The memory access will be more efficient. It doesn’t guarantee that the overall performance will be better but in most cases it does help.

Mat

lamtnguyen · June 11, 2014, 1:22am

Thanks Mat,

Lam