What pays better off – a coalesced memory access or less communication with the global memory?
A large array needs to be recalculated on CUDA in each kernel invocation. The array is stored in the global memory and loaded to the shared memory for the computation.
When calculating the new value of an element, each thread requires values of its all (in 2d) adjacent array elements. It seems that in this setup we can’t have both, so an expert comment on the following three scenarios (or a recommendation of a better one ;) would be highly appreciated:
- Array is logically divided into 16x16 slices - Each 16x16 Thread Block recalculates one 16x16 slice - Threads make 1 LD from GM to SM - Additionally threads working on the boundary elements make an extra LD of an element from an adjacent block to a register - After the calculation each thread makes 1 WR to the corresponding 16x16 slice element in GM
High register usage (ca. 20), worst case of 3 GM accesses per calculation, but the access can be done in coalesced way for all first loads
Array is logically divided into 16x16 slices
Each 18x18 Thread Block recalculates one 16x16 slice
Threads make 1 LD from GM to SM
After the calculation each inner thread makes 1 WR to the corresponding 16x16 slice element in GM
Low register usage (ca. 7), 2 GM accesses per calculation, uncoalesced due to the offsets introduced by extra columns and rows
Array is transformed on the host so that it contains additional rows and columns that reflect the values of the boundary elements
Array is logically divided into 18x18 slices, where only inner 16x16 contain the original information
Each 18x18 Thread block recalculates inner 16x16 elements of the 18x18 slice
Each Thread loads 1 element into SM
Only inner 16x16 threads are doing the calculation
16x16 Threads write to GM, after sync other threads take also the results of these from 16x16 calculations, and write them to the padded array columns for the next iteration.
Low register usage, fully coalesced RD, uncoalesced write (aprox. 1 thread/warp diverging)