I believe there is an issue with the CUDA compiler/optimizer that actually forces my particular code to be uncoalesced, even though the data is structured in such a way that coalescing should occur. I can, in fact, force the issue and make it be coalesced, but it is a total hack to make that be the case. Can someone either a) confirm that this is a real issue or B) tell me how I don’t know what I am talking about? Either would work for me!
Some background - I have an array of complex data, called in. in is actually stored as a (float2 *).
The kernel in question needs to return the real part of in. The following kernel yields many uncoalesced reads:
global void getReals(int N, float *out, float2 *in)
{
unsigned int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int gridSize = blockDim.x * gridDim.x; // stride by a gridsize number of threads
while(idx < N)
{
out[idx] = in[idx].x;
idx += gridSize;
}
}
The following simple tweak yields all coalesced reads, is substantially faster, and of course, yields the wrong answer:
global void getReals(int N, float *out, float2 *in)
{
unsigned int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int gridSize = blockDim.x * gridDim.x; // stride by a gridsize number of threads
while(idx < N)
{
out[idx] = in[idx].x + in[idx].y * 0.00000000000001;
idx += gridSize;
}
}
So, what seems to be going on is that the compiler, in the first case, sees through my attempt at reading in a float2 and instead reads in a float (since it realizes that I never use the .y part). What this means is that successive threads read non-contiguous floats, and I get uncoalesced reads.
In the second case, I use a teeny bit of .y part - just enough to make the compiler not be too smart for its own good, and I get entirely coalesced results. Plus, the code is about 3x faster. But of course, I don’t like this solution - a) its wrong, and B) its a hack.
Any thoughts?
FWIW - I am using CUDA 2.3.0, on a 9800 GTX.
Thanks!
Eddie
ps -I have already tried this:
float2 temp; temp = in[idx]; out[idx] = temp.x; <<— Doesn’t help