Uncoalesced memory access vs. more loads from GM Optimization advice needed!

What pays better off – a coalesced memory access or less communication with the global memory?

A large array needs to be recalculated on CUDA in each kernel invocation. The array is stored in the global memory and loaded to the shared memory for the computation.

When calculating the new value of an element, each thread requires values of its all (in 2d) adjacent array elements. It seems that in this setup we can’t have both, so an expert comment on the following three scenarios (or a recommendation of a better one ;) would be highly appreciated:

Scenario 1:

- Array is logically divided into 16x16 slices

- Each 16x16 Thread Block recalculates one 16x16 slice

- Threads make 1 LD from GM to SM

- Additionally threads working on the boundary elements make an extra LD of an element from an adjacent block to a register  

- After the calculation each thread makes 1 WR to the corresponding 16x16 slice element in GM

High register usage (ca. 20), worst case of 3 GM accesses per calculation, but the access can be done in coalesced way for all first loads

Scenario 2:

  • Array is logically divided into 16x16 slices

  • Each 18x18 Thread Block recalculates one 16x16 slice

  • Threads make 1 LD from GM to SM

  • After the calculation each inner thread makes 1 WR to the corresponding 16x16 slice element in GM

Low register usage (ca. 7), 2 GM accesses per calculation, uncoalesced due to the offsets introduced by extra columns and rows

Scenario 3:

  • Array is transformed on the host so that it contains additional rows and columns that reflect the values of the boundary elements

  • Array is logically divided into 18x18 slices, where only inner 16x16 contain the original information

  • Each 18x18 Thread block recalculates inner 16x16 elements of the 18x18 slice

  • Each Thread loads 1 element into SM

  • Only inner 16x16 threads are doing the calculation

  • 16x16 Threads write to GM, after sync other threads take also the results of these from 16x16 calculations, and write them to the padded array columns for the next iteration.

Low register usage, fully coalesced RD, uncoalesced write (aprox. 1 thread/warp diverging)

Hello,
i am trying to implement a cellular automata on CUDA and obviously i have the same problem and found the same solutions.

Have you done some benchmarks on this 3 solutions ?

I dont understand why your reads are coalesced in solution 3.

I would love to revive this as I am working through the same issues. Any solutions yet? With Cg I’ve obtained ~35x speedup and can’t get CUDA to even breach CPU speed.

Thanks