gld coalesced = 0, but addresses are aligned!

Hi, I have a NxDIM matrix, storing N DIM-dimensional data items. I’m trying to write a kernel that computes the distances from each data item to other DIM-dimensional points, stored in a second matrix.
I have created a grid of blocks, and each thread first of all loads one cell of the NxDIM matrix (stored in global device memory) into a shared memory location. Now my problem is that the profiler shows that ALL loads from global memory are uncoalesced, but printing the actual pointers values used to access the global memory reveals that the addresses are indeed aligned to 4 bytes. This is what I get if I store the pointers to another NxDIM matrix and then print it:

2162688 2162692 2162696 2162700 2162704 2162708 2162712 2162716 2162720 2162724 2162728 2162732 2162736 2162740 2162744 2162748
2162752 2162756 2162760 2162764 2162768 2162772 2162776 2162780 2162784 2162788 2162792 2162796 2162800 2162804 2162808 2162812

The starting matrix address is 2162688, which makes me think that all loads ought to be coalesced, actually. I’m running my code on a 8600M GT (compute cap. 1.1).
What could be wrong with my app?

Thank you in advance, just let me know if you need any more details…

Coalescing doesn’t just require that the load/store addresses are type size aligned, it also requires that half-warps load/store from contiguous segments of 16 * sizeof(type). What you have posted seems to confirm that the first criteria is satisfied, but what about the second?

This is how I allocate the matrix in device space:

cudaMallocPitch((void**)&dataset, &d_pitch, dpadd*sizeof(T), nitpadd)

where dpadd is a multiple of 16 (exactly 16, in this case) and T is float.

It should be fine and, AFAIK, there is also some “overabundance” of padding, in the sense that since I’m using pitched memory I could have done without manually padding DIM to a multiple of 16…

Am I getting it right?

What about the load/store patterns in the device code? The GPU can only coalesce if they match the access patterns described in the programming guide.

Again, they look fine to me…

Here’s how I compute the index of the row to be accessed:

// computing item index

i = threadIdx.x + (blockIdx.x * gridDim.y + blockIdx.y) * BLOCK_ROWS;

// load datum into shared memory (each thread loads its dimension)	

datum[threadIdx.x*dpadd+threadIdx.y] = *((float*)((char*)dataset + i*d_pitch + threadIdx.y*sizeof(float)));

Note that I have blocks of BLOCK_ROWS rows, each with dpadd ( = 16 ) columns, so that the threads in a row access a row of the NxDIM matrix; besides, they should form a half-warp, thus coalescing memory accesses. Those thread blocks are organized in a grid so that grid_sizegrid_sizeBLOCK_ROWS = N + padd (a padding added so that no thread accesses any address that lies outside the matrix).

I wouldn’t like to guess what the compiler makes of that. What type is dataset declared as?

Uh it’s just a plain float.
By the way, I’ve been profiling an older version of the program, and I’ve noticed with great disappointment that all loads are coalesced!
The difference is that, instead of creating a 2D grid of blocks, each processing BLOCK_ROWS data items, that version creates a 1d grid of blocks, each processing a single data items. So basically I had a grid of Nx1 blocks, which couldn’t process large dataset. But still, that was coalesced, and I can’t figure out why…

What is d_pitch ?

That’s the pitch value returned from cudaMallocPitch; I pass it as a parameter to my kernel so it can perform aligned memory accesses, as shown in the programming guide

OK in that case there is no way that can coalesce. The only possible way that could be coalesced (I think) was if d_pitch was sizeof(float). Threads are ordered in column major order (blockIdx.x is the fastest varying dimension within a block). With the *char cast, it would have to be sizeof(float) for each thread in a half warp to read from contiguous addresses.

I must say that is probably the most convoluted and ugly indexing code I think I have ever seen. I must have stared at it for about 5 minutes and I still don’t understand what it is you are trying to do…

Well I’ve managed to solve the problem, now I get all loads coalesced. There was a major problem in my indexing scheme: I assumed threads were organized in row-major order (I guess you mean threadIdx.x - not blockIdx.x - is the fastest varying dimension within a block, but ok I got the point). The old version did manage to coalesce all loads because blocks were 1x16, so all threads were “assigned” to the same half-warp.

So a big thanks goes to you for the patience you have shown (mea culpa). As for the ugly indexing scheme, I am trying to handle LARGE datasets, with indices spanning in quite a wide range, and that mapping was a way I’ve come up with to handle them while still mantaining decent levels of occupancy. Any suggestion is highly welcome.

Thank you again, have a nice day ;)