Serialized ThreadIdx.x? Weird non-parallel behavior


My entire CUDA experience is based my current project, which has been ongoing for two months. I’ve read the documentation, and anything else I’ve learned has been through trial and error. I only mention this to describe my “cuda knowledge level.”

I’m hoping someone can help me even though I won’t be able to post much code. I’m not talking much about this project yet.

The deal is, I have a kernel that does computation on a long series of monotonically increasing 128-bit numbers. I pass a starting 128-bit value, then have each cuda thread compute it’s 128 bit number by adding it’s thread id to the initial number.

The weird thing is, adding the thread id to the initial number is really slow. More strange, when I reduce the number of threads that actually do the add, I get better performance, suggesting that something is serializing. I’m sorry I can post the full code, but here are some critical snippets:

absThreadId = (blockIdx.x * blockDim.x) + threadIdx.x;

  /* this if-else is for debugging only. Normally all threads should add their absThreadID*/

  if (absThreadId < 1)

	counter3 = absThreadID;


	counter3 = 1;


/* some computations */

counter3 += initial_value[3];

/* more computations */

So, the if statement in the code shows how many threads are actually adding their absThreadId. Here are my timings for the kernel:

Adding Threads :: time (ms)

1 :: 0.047

4 :: 0.057

8 :: 0.076

12:: 0.097

16:: 0.116

I’m sure that the problem is in the parts of the code I’m not showing you, but maybe somebody has an idea that will get me looking in the right directory, or lead me to post other parts of the code that are more useful.


Never mind. I figured it out. Some stupid caching problem.