Help with kernel

Hi everyone.

I’ve been wondering if the following kernel would have problems with non-coalesced accesses,
and if someone knows how I can improve that performance problem:

global
void uncompress(float *a, char *b, char *s, int N )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int idx = (int) a[i];

if ( i < N )
{
    b[i] = s[idx];
}

}

Ted.

The load/store of idx should coalesce, but the load and store of the char won’t, both because there is no coalesced access for byte sized types, and because of the index values themselves and how much “entropy” there is in them. You might want to look at using a texture for reads from s.

Interesting, I imagine this one would have the same problem for non-coalesced accesses because of the index, but would there be a way to coalesce the accesses since it is a float type.

global

void writeChangeLocations(float *x, float *c, int N )

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

float Cvalue = 1.0;

int idx = x[i];

if ( i < N && idx != 0)

{

    c[idx] = Cvalue;

}

}

Ted.

In this one, the index read should coalesce, the write won’t. Coalescing the write would require sequential idx values, and I am guessing that isn’t hat you have

If it is almost what you have (so there is some order or manageable range in the idx values a block will get), then one way that you might be able to at least partiailly engineer it is to use a shared memory buffer for the output. If a block chooses an output idx range, and puts stored values into the output buffer in order, then a warp or half warp of threads can then do the write in coalesce chunks to the indexes that need to be stored. Threads that have to write outside the range of the block would still be uncoalesced, but even 50% coalesced would be better than nothing.

This would be for GPUs with computability less than 1.2 though right?

GPUs with computability of 1.2 or above would coalesce the writes, no?

Ted

Probably not. Compute 1.2 and above can add an additional transaction to a load or store request in cases where coalescing rules would otherwise force full serialisation of the load or store. But that is all it can do, So if your half warp write needs to spread data over a wide range of indices it will still be serialized.

Hmm I checked in the profiler, and all the loads and stores are coalesced. How reliable is the profiler for this?