Baffling problem leads to 10x worse results.

Brad_Peterson · May 13, 2010, 12:32pm

Hey all. I’ve been pounding away at this performance problem for the last 10 hours. The problem is just baffling to me. I can’t see to figure out what’s going on.

The gist of it is this. I have a kernel organized into four sections:

Part A - 32 threads read data out of global memory in a coalesced fashion, and store the result in shared memory

Part B - 7 threads read the data out of shared memory, perform computations, and accumulate a result (due to the data structure involved, 7 threads right now keeps the code clean and simple)

Part C - 7 threads store the accumulated result in another array in shared memory

Part D - Thread 0 looks at the 7 prior results, finds the best one, and writes a float result back to global memory.

I’ve found that if I omit either part B or Part C, then my code performs roughly 3x times faster than a naive kernel I coded earlier.

But if I include both part B and part C, then the code slows down dramatically, as much as 10x! What’s really strange is that part C is just a single shared memory write. That’s it. There doesn’t seem to be any reason why it would slow things down so dramatically.

The source code calculates correctly, but it is rather complicated and ugly. So I’ll give the psuedocode

__global__ void myKernel(about 10 parameters) {

	__shared__ float energy[32];

	__shared__ float shared_geoMap[7][64 + 1];	

	float accumulatedEnergy = 0;

	about 10 other registers;

		

	for loop to tile.  Used to manage and re-use shared_geoMap[][] {

		//Part A

		inner for loop {

			Data is read from global memory, stored in shared memory.

		}

		//Part B

		if ( threadIdx.x < 7) {							

			for (i = 0; i < appropriate loop limit; i++) {

				accumulatedEnergy += exp(-0.5f * (shared_geoMap[threadIdx.x][i] / sigma_squared));

			}

		}	

	}

			

	//Part C

	if ( threadIdx.x < 7) 

		energy[threadIdx.x] = accumulatedEnergy;

	//Part D

	if (threadIdx.x == 0) {		

		//Look at the 7 possible energy[] values.  Choose the best one.

		//write one float value into global memory

	}			

}

int main() {

	//Load in about 64MB of data from file, store it in GPU global memory

		loadDataFromFile("Filename.dat");

	

		myKernel<<<1000,32>>>(About 10 arguments);

}

Again, I’ve been at this for many hours. I’ve tried to reduce the code as much possible to simplify and discover the problem. But this is as reduced as I can get it. I’ve also tried using the Cuda Visual Profiler. It seems to give me two clues. First, divergent branching is large. Second, in the section labeled “gst request”, which the documentation states is the number of global memory store requests, I get some odd results.

If I leave out either part B or Part C, then I get

divergent branches: about 336

gst request: around 6 million

If I put in both part B and part C, then I get

divergent branches: about 2900

gst request: around 74 million

That just baffled me. Part C is simply writing 7 items in parallel to shared memory. So why would that one piece of code add in ~2600 divergent branches and 68 million more global memory store requests (especially when part C doesn’t even write to global memory).

Any ideas would be welcome at this point. I seem to have exhausted all debugging scenarios I can think of on my end.

avidday · May 13, 2010, 12:39pm

The open64 compiler has rather aggressive analysis and code removal optimizations. Eliminating PartC (a memory store) probably removes a dependency for much or even all of the code that proceeds it, allowing the compiler to optimize a lot of “dead” code away. You should be able to confirm this by having a look at the ptx and cubin output from the compiler. Even if you can’t follow the assembler, the instruction and register count between code including part C and without it should make it pretty obvious what is happening.

Lev · May 13, 2010, 5:55pm

Do you use exp in your real code? Do you use division? And 32 is most likely not properly block size, for many reasons. Also consider redesigning your program so for (i = 0; i < appropriate loop limit; i++) will be threaded

Brad_Peterson · May 13, 2010, 7:22pm

Yes, exp is in the real code.
It also uses division (for the test, it was just diving by 1, but I should change that so it multiplies by the reciprocal)
I’m aware 32 is not a good block size. However, my goal first has been to start with something conceptually simple, and tweak it from there. 32 threads per block size makes things work well for this algorithm. Increasing it to 64 would add complications.
The same goes for the for loop optimization. If I can get my current algorithm performing correctly, I’ll start tweaking it.

However, I’m still at a loss. All of what you mentioned is in my part “B” code. If I have code for part A, B, and D compiled (omitting part C), then the performance is still very desirable.

I’ve been looking at the .ptx, and compiling various scenarios. But I’m not seeing anything out of the ordinary in the ptx when I vary the code.

I definitely appreciate the help right now. It’s just maddening knowing that when I compile about 98% of the code needed, I get the performance results I want. Any time i try to add in that seemingly inconsequential last 2%, my performance suffers greatly. And I just don’t know what to do next…

Lev · May 13, 2010, 7:49pm

Most likely as avidday said, compiler just remove computation of accumulatedEnergy if you remove using it. That bad performance maybe normal for this code. exp uses double btp, use expf of __expf for low precision. Your code with 7 working threads and 32 block size probably works about 8 times slower than code with 32 threads and more proper block size. And exp is slow too.

HiRez · May 13, 2010, 10:30pm

Don’t you need a __syncthreads() after Part A?

Brad_Peterson · May 14, 2010, 12:20am

Lev,

You nailed it.

exp() is by far the biggest drag on the algorithm.
The compiler is dropping the exp() calculation if I remove either parts B (obviously, since exp() is in that part) or part C (since accumulatedenergy wouldn’t be used).
Replacing the division by multiplication by the reciprocal sped things up 15%
Having exp() calculate only on threads 0-6 really slowed things down. I was able to move the exp() calculation up into part A, which uses all 32 threads, which sped things up another 3 times.
Replacing exp() with __expf() took the previous results and improved it another roughly 3x.

A lot of this seems obvious now, I feel kind of dumb for not spotting it earlier.

Also, I made some of these optimizations to my naive algorithm, and now my naive algorithm is performing better. Argh! :P And I really thought my second algorithm was clever in how it uses coalescing. Oh well, I’m much happier now that I understand what’s going on.

Thanks again for the help.

Lev · May 14, 2010, 8:17am

Do not forget about syncthreads after filling shared memory as HiRex said.