gld coalesced = 0, but addresses are aligned!

claudiotainen · March 20, 2010, 12:26pm

Hi, I have a NxDIM matrix, storing N DIM-dimensional data items. I’m trying to write a kernel that computes the distances from each data item to other DIM-dimensional points, stored in a second matrix.
I have created a grid of blocks, and each thread first of all loads one cell of the NxDIM matrix (stored in global device memory) into a shared memory location. Now my problem is that the profiler shows that ALL loads from global memory are uncoalesced, but printing the actual pointers values used to access the global memory reveals that the addresses are indeed aligned to 4 bytes. This is what I get if I store the pointers to another NxDIM matrix and then print it:

2162688 2162692 2162696 2162700 2162704 2162708 2162712 2162716 2162720 2162724 2162728 2162732 2162736 2162740 2162744 2162748
2162752 2162756 2162760 2162764 2162768 2162772 2162776 2162780 2162784 2162788 2162792 2162796 2162800 2162804 2162808 2162812
…

The starting matrix address is 2162688, which makes me think that all loads ought to be coalesced, actually. I’m running my code on a 8600M GT (compute cap. 1.1).
What could be wrong with my app?

Thank you in advance, just let me know if you need any more details…

avidday · March 20, 2010, 12:41pm

Coalescing doesn’t just require that the load/store addresses are type size aligned, it also requires that half-warps load/store from contiguous segments of 16 * sizeof(type). What you have posted seems to confirm that the first criteria is satisfied, but what about the second?

claudiotainen · March 20, 2010, 1:32pm

This is how I allocate the matrix in device space:

cudaMallocPitch((void**)&dataset, &d_pitch, dpadd*sizeof(T), nitpadd)

where dpadd is a multiple of 16 (exactly 16, in this case) and T is float.

It should be fine and, AFAIK, there is also some “overabundance” of padding, in the sense that since I’m using pitched memory I could have done without manually padding DIM to a multiple of 16…

Am I getting it right?

avidday · March 20, 2010, 1:47pm

What about the load/store patterns in the device code? The GPU can only coalesce if they match the access patterns described in the programming guide.

claudiotainen · March 20, 2010, 3:23pm

Again, they look fine to me…

Here’s how I compute the index of the row to be accessed:

// computing item index

i = threadIdx.x + (blockIdx.x * gridDim.y + blockIdx.y) * BLOCK_ROWS;

// load datum into shared memory (each thread loads its dimension)	

datum[threadIdx.x*dpadd+threadIdx.y] = *((float*)((char*)dataset + i*d_pitch + threadIdx.y*sizeof(float)));

Note that I have blocks of BLOCK_ROWS rows, each with dpadd ( = 16 ) columns, so that the threads in a row access a row of the NxDIM matrix; besides, they should form a half-warp, thus coalescing memory accesses. Those thread blocks are organized in a grid so that grid_sizegrid_sizeBLOCK_ROWS = N + padd (a padding added so that no thread accesses any address that lies outside the matrix).

avidday · March 20, 2010, 4:06pm

I wouldn’t like to guess what the compiler makes of that. What type is dataset declared as?

claudiotainen · March 20, 2010, 5:00pm

Uh it’s just a plain float.
By the way, I’ve been profiling an older version of the program, and I’ve noticed with great disappointment that all loads are coalesced!
The difference is that, instead of creating a 2D grid of blocks, each processing BLOCK_ROWS data items, that version creates a 1d grid of blocks, each processing a single data items. So basically I had a grid of Nx1 blocks, which couldn’t process large dataset. But still, that was coalesced, and I can’t figure out why…

avidday · March 20, 2010, 5:55pm

What is d_pitch ?

claudiotainen · March 20, 2010, 6:07pm

That’s the pitch value returned from cudaMallocPitch; I pass it as a parameter to my kernel so it can perform aligned memory accesses, as shown in the programming guide

avidday · March 20, 2010, 6:32pm

OK in that case there is no way that can coalesce. The only possible way that could be coalesced (I think) was if d_pitch was sizeof(float). Threads are ordered in column major order (blockIdx.x is the fastest varying dimension within a block). With the *char cast, it would have to be sizeof(float) for each thread in a half warp to read from contiguous addresses.

I must say that is probably the most convoluted and ugly indexing code I think I have ever seen. I must have stared at it for about 5 minutes and I still don’t understand what it is you are trying to do…

claudiotainen · March 20, 2010, 7:12pm

Well I’ve managed to solve the problem, now I get all loads coalesced. There was a major problem in my indexing scheme: I assumed threads were organized in row-major order (I guess you mean threadIdx.x - not blockIdx.x - is the fastest varying dimension within a block, but ok I got the point). The old version did manage to coalesce all loads because blocks were 1x16, so all threads were “assigned” to the same half-warp.

So a big thanks goes to you for the patience you have shown (mea culpa). As for the ugly indexing scheme, I am trying to handle LARGE datasets, with indices spanning in quite a wide range, and that mapping was a way I’ve come up with to handle them while still mantaining decent levels of occupancy. Any suggestion is highly welcome.

Thank you again, have a nice day ;)

Topic		Replies	Views
Correct understanding coalesced memory loading? CUDA Programming and Performance	7	5413	July 30, 2008
Coalesced Loads Question CUDA Programming and Performance	4	1965	July 7, 2008
Uncoalesced reads; Coalesced writes Same access pattern; differenct coalesced I/O outcome? CUDA Programming and Performance	5	3304	December 12, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2238	December 9, 2010
Misaligned starting address for memory coalescing CUDA Programming and Performance	4	3669	March 31, 2011
Uncoalesced on matrix by vector multiplication CUDA Programming and Performance	3	8041	June 24, 2009
Need help on non-coalesced access CUDA Programming and Performance	0	1167	May 9, 2009
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10286	June 28, 2009
Coalescing - beginner question CUDA Programming and Performance	10	1884	June 23, 2010
How to understand the alignment of 2D array and fully coalesce of the memory access CUDA Programming and Performance	7	3645	July 27, 2016

gld coalesced = 0, but addresses are aligned!

Related topics