kernel cannot utilize the full hardware resource?

boricworld · August 18, 2009, 3:50pm

I was doing a simple test to see how many concurrent thread blocks can run on a Tesla C1060.

The kernel is very straightforward: after launching, thread (0,0,0) of each thread block sets a flag, then the block(0,0) checks if all flags are set, if yes the kernel finishes.

Since C1060 has 30 SM, and each can support 1024 active threads, I assume that I can have 120 concurrent thread blocks, each of 256 threads. But if I run with this execution configuration, i.e., <<<120,256>>>, the kernel will hang forever, the flags for blocks (90,0) to (119,0) will never be set. However, if I run with <<<120,192>>>, the kernel finishes smoothly. But in this case, the number of concurrent active threads is 192 x 4 = 768, which is only 75% of 1024, the maximum number of active threads supported. Strange enough, if I run with <<<240,128>>>, the kernel exits correctly, which means the device can support running kernels with the full hardware resource :wacko:

Any clues?

_Big_Mac · August 18, 2009, 8:32pm

1024 threads can be queued/allocated per MP. Only 32 (a warp) of them are executing at any given time. So physically the maximum number of active threads during any 4 cycles is 30 x 32. The 1024 threads (32 warps) are shuffled and timesliced, you don’t know their order of execution. They are a pool of threads ready to process from which the MP can select a warp which is not waiting for a memory fetch or __syncthreads.

boricworld · August 18, 2009, 9:21pm

Thanks for the reply. I think I should define “active” in my problem. By saying “active”, I mean the thread has started, but has not finished its execution, although it may be swapped out due to waiting for memory load or synchronizations. I wanted to know how many active blocks/threads can be running concurrently. In my case, it seems <<<120,256>>> is not possible, while <<<90,256>>> and <<<240,128>>> are okay.

tmurray · August 18, 2009, 10:54pm

determine maximal concurrent occupancy of GT200

It is pitch black. You are likely to be eaten by a grue.

Seriously, the answer is “it depends on a whole lot of things in really obscure ways.” It is not easy to figure out, and there is no good way to even use this knowledge right now.

MarkRoulo · August 19, 2009, 12:07am

Could you post the source code for your kernel?

-Mark Roulo

boricworld · August 19, 2009, 9:40pm

Er…okay. Can we expect something not that dark from GT300? :rolleyes:

boricworld · August 19, 2009, 9:45pm

Sorry I don’t have the code right now. Actually it is very simple, you can do it in 10 mins. Basically you cudaMalloc() a chunk of memory of size equal to the number of thread blocks, then pass the address to the kernel. In the kernel, thread (0,0,0) of each thread block sets its individual location of the memory, then the block(0,0) checks if all locations are set, if yes, then the block(0,0) sets a “go” flag to notify all thread blocks to exit.

MarkRoulo · August 19, 2009, 10:03pm

It is your description of “block(0,0) checking to see if all locations are set” that I find interesting.

Blocks don’t do anything. Threads do.

Do you mean that you have a kernel like this:

__global__ void hi(unsigned int *scribble)

{

	if (threadIdx.x == 0)

		scribble[blockIdx.x] = 1;

	if (blockIdx.x == 0)

	{

		for (int i = 0; i < blockDim.x; ++i)

		{

			while (scribble[i] == 0)

			   ;

		}

	}

}

and then launch like this:

hi<<<120, 256>>>hi(theMemory);

you hang, but if you launch like this:

hi<<<240, 128>>>(theMemory);

if works?

Have you zero-ed the memory you allocated with cudaMalloc()?

boricworld · August 19, 2009, 10:36pm

Yes, by saying block(0,0), I refer to threads in block(0,0).

__syncthreads() and setting flags by block(0,0) are also needed in the kernel.

And, yes, I do call cudaMemset() before launching the kernel.

Topic		Replies	Views
Amount of Shared Memory CUDA Programming and Performance	10	4206	June 3, 2010
Yet another "How many threads run concurrently ?" question. CUDA Programming and Performance	10	14868	December 14, 2010
Kernel Execution issues related to Shared Memory CUDA Programming and Performance	5	5155	November 9, 2009
Strange result Comparing a Tesla C1060 against GTS 250 CUDA Programming and Performance	16	1986	December 4, 2010
Max gridDim.x ? CUDA Programming and Performance	7	4488	March 11, 2010
Tesla C1060 Max blocks per Streaming Multiprocessor CUDA Programming and Performance	14	10531	November 30, 2011
How many kernel instances are actually running? CUDA Programming and Performance	4	1115	September 3, 2010
limits on threads per kernel launch CUDA Programming and Performance	3	2736	March 14, 2012
Numer of active threads CUDA Programming and Performance	3	2354	July 7, 2009
Multiprocessors, Cores, Threads and Parallelism CUDA Programming and Performance	5	13823	November 8, 2010

kernel cannot utilize the full hardware resource?

Related topics