kernel cannot utilize the full hardware resource?

I was doing a simple test to see how many concurrent thread blocks can run on a Tesla C1060.

The kernel is very straightforward: after launching, thread (0,0,0) of each thread block sets a flag, then the block(0,0) checks if all flags are set, if yes the kernel finishes.

Since C1060 has 30 SM, and each can support 1024 active threads, I assume that I can have 120 concurrent thread blocks, each of 256 threads. But if I run with this execution configuration, i.e., <<<120,256>>>, the kernel will hang forever, the flags for blocks (90,0) to (119,0) will never be set. However, if I run with <<<120,192>>>, the kernel finishes smoothly. But in this case, the number of concurrent active threads is 192 x 4 = 768, which is only 75% of 1024, the maximum number of active threads supported. Strange enough, if I run with <<<240,128>>>, the kernel exits correctly, which means the device can support running kernels with the full hardware resource :wacko:

Any clues?

1024 threads can be queued/allocated per MP. Only 32 (a warp) of them are executing at any given time. So physically the maximum number of active threads during any 4 cycles is 30 x 32. The 1024 threads (32 warps) are shuffled and timesliced, you don’t know their order of execution. They are a pool of threads ready to process from which the MP can select a warp which is not waiting for a memory fetch or __syncthreads.

Thanks for the reply. I think I should define “active” in my problem. By saying “active”, I mean the thread has started, but has not finished its execution, although it may be swapped out due to waiting for memory load or synchronizations. I wanted to know how many active blocks/threads can be running concurrently. In my case, it seems <<<120,256>>> is not possible, while <<<90,256>>> and <<<240,128>>> are okay.

determine maximal concurrent occupancy of GT200

It is pitch black. You are likely to be eaten by a grue.

Seriously, the answer is “it depends on a whole lot of things in really obscure ways.” It is not easy to figure out, and there is no good way to even use this knowledge right now.

Could you post the source code for your kernel?

-Mark Roulo

Er…okay. Can we expect something not that dark from GT300? :rolleyes:

Sorry I don’t have the code right now. Actually it is very simple, you can do it in 10 mins. Basically you cudaMalloc() a chunk of memory of size equal to the number of thread blocks, then pass the address to the kernel. In the kernel, thread (0,0,0) of each thread block sets its individual location of the memory, then the block(0,0) checks if all locations are set, if yes, then the block(0,0) sets a “go” flag to notify all thread blocks to exit.

It is your description of “block(0,0) checking to see if all locations are set” that I find interesting.

Blocks don’t do anything. Threads do.

Do you mean that you have a kernel like this:

__global__ void hi(unsigned int *scribble)

{

	if (threadIdx.x == 0)

		scribble[blockIdx.x] = 1;

	if (blockIdx.x == 0)

	{

		for (int i = 0; i < blockDim.x; ++i)

		{

			while (scribble[i] == 0)

			   ;

		}

	}

}

and then launch like this:

hi<<<120, 256>>>hi(theMemory);

you hang, but if you launch like this:

hi<<<240, 128>>>(theMemory);

if works?

Have you zero-ed the memory you allocated with cudaMalloc()?

Yes, by saying block(0,0), I refer to threads in block(0,0).

__syncthreads() and setting flags by block(0,0) are also needed in the kernel.

And, yes, I do call cudaMemset() before launching the kernel.