Hey all. I’ve been pounding away at this performance problem for the last 10 hours. The problem is just baffling to me. I can’t see to figure out what’s going on.
The gist of it is this. I have a kernel organized into four sections:
Part A - 32 threads read data out of global memory in a coalesced fashion, and store the result in shared memory
Part B - 7 threads read the data out of shared memory, perform computations, and accumulate a result (due to the data structure involved, 7 threads right now keeps the code clean and simple)
Part C - 7 threads store the accumulated result in another array in shared memory
Part D - Thread 0 looks at the 7 prior results, finds the best one, and writes a float result back to global memory.
I’ve found that if I omit either part B or Part C, then my code performs roughly 3x times faster than a naive kernel I coded earlier.
But if I include both part B and part C, then the code slows down dramatically, as much as 10x! What’s really strange is that part C is just a single shared memory write. That’s it. There doesn’t seem to be any reason why it would slow things down so dramatically.
The source code calculates correctly, but it is rather complicated and ugly. So I’ll give the psuedocode
__global__ void myKernel(about 10 parameters) {
__shared__ float energy[32];
__shared__ float shared_geoMap[7][64 + 1];
float accumulatedEnergy = 0;
about 10 other registers;
for loop to tile. Used to manage and re-use shared_geoMap[][] {
//Part A
inner for loop {
Data is read from global memory, stored in shared memory.
}
//Part B
if ( threadIdx.x < 7) {
for (i = 0; i < appropriate loop limit; i++) {
accumulatedEnergy += exp(-0.5f * (shared_geoMap[threadIdx.x][i] / sigma_squared));
}
}
}
//Part C
if ( threadIdx.x < 7)
energy[threadIdx.x] = accumulatedEnergy;
//Part D
if (threadIdx.x == 0) {
//Look at the 7 possible energy[] values. Choose the best one.
//write one float value into global memory
}
}
int main() {
//Load in about 64MB of data from file, store it in GPU global memory
loadDataFromFile("Filename.dat");
myKernel<<<1000,32>>>(About 10 arguments);
}
Again, I’ve been at this for many hours. I’ve tried to reduce the code as much possible to simplify and discover the problem. But this is as reduced as I can get it. I’ve also tried using the Cuda Visual Profiler. It seems to give me two clues. First, divergent branching is large. Second, in the section labeled “gst request”, which the documentation states is the number of global memory store requests, I get some odd results.
If I leave out either part B or Part C, then I get
divergent branches: about 336
gst request: around 6 million
If I put in both part B and part C, then I get
divergent branches: about 2900
gst request: around 74 million
That just baffled me. Part C is simply writing 7 items in parallel to shared memory. So why would that one piece of code add in ~2600 divergent branches and 68 million more global memory store requests (especially when part C doesn’t even write to global memory).
Any ideas would be welcome at this point. I seem to have exhausted all debugging scenarios I can think of on my end.