I am tuning some CUDA code at the moment. I made some changes and was expecting to see a large speedup, but it made no difference.
I have some arrays of data that I am processing with blocks of threads. Successive threads were processing non-sequential entries in the array, based on an itemIndex array, which is an array of items that need to be processed. Something like this…
__global__ void MyFunction(int *itemIndex, int itemCount, int *d_p1, int *d_p2,
float *d_length, float *d_percent, float *d_precalc, float *d_output)
{
int i = (blockIdx.x * THREADSPERBLOCK) + threadIdx.x;
if (i >= itemCount) {
return;
}
int lcix = itemIndex[i]; // get the next item to process
// Copy Global data into registers
int p1 = d_p1[lcix];
int p2 = d_p2[lcix];
float length = d_length[lcix];
float percent = d_percent[lcix];
float precalc = d_precalc[lcix];
float result;
// Process item
... various calculation on the data
d_output[lcix] = result;
}
Now originally, the itemIndex array was sort of random, but actually ended up processing alternate items {0,2,4,6, etc}, so I did some preprocessing on the arrays to make itemIndex sequential, and I was expecting a significant speedup, but I didn’t see any improvement at all. Just to confirm, I also tried complete random ordering of the items, so that itemIndex was something like {18427, 3123, 23014, 15923, 8129, etc}, and it still made no difference to the execution time.
From what I’ve read, reading from global memory is slow, but by reading sequential entries with each thread in a warp, it should be much faster. I’m running several thousand threads, in blocks of 32 threads per block.
Any ideas? Am I correct that in-order access should be faster?