[SOLVED] Concurrent Kernel Execution

I have a quadro k4000 GPU (compute capability 3.0) with 2 monitors attached. When I open the Nvidia control panel, it says somewhere less than 10% of the GPU is in use without running any GPGPU code.

I am attempting to run a kernel I wrote with 4 copies. Initially I just copied the kernel call 4 times, but was not able to see any concurrent execution in the Nvidia Profiler, i.e., all of the calls executed sequentially. To cut through some long reasoning, I decided to write a dummy kernel that takes 2 int arrays, adds them, and stores them in a third. Here is that kernel.

__global__ void dummy_kernel(int *a, int *b, int *c, int size)
	int tidx = threadIdx.x;
	for (int jj = 0; jj < 1024*8; ++jj)
		int temp = jj % 64;
		for (int ii = tidx; ii < size; ii += 32)
			if (ii + temp >= size)
				temp = 0;
			c[ii] = a[ii] + b[ii];

This is called as:

cudaStream_t cuda0, cuda1, cuda2, cuda3;
	cudaStreamCreateWithFlags(&cuda0, cudaStreamNonBlocking);
	cudaStreamCreateWithFlags(&cuda1, cudaStreamNonBlocking);
	cudaStreamCreateWithFlags(&cuda2, cudaStreamNonBlocking);

	fprintf(stdout, "Starting GPU code (with dummy kernel).\n");
	cudaEvent_t start, stop;
	clock_t begin = clock();

	my_kernel    << < 400, 32, 0, cuda0 >> >(parameter_list);
	dummy_kernel << < 400, 32, 0, cuda1 >> >(a1, b1, c1, 1024);
	dummy_kernel << < 400, 32, 0, cuda2 >> >(a2, b2, c2, 1024);
	dummy_kernel << < 400, 32, 0, cuda3 >> >(a3, b3, c3, 1024);


When I compiled and ran this in the NVidia profiler, there was very little execution overlap between kernels. Am I doing something wrong with how I set up the streams or is there something else that I need to do? The GPU Utilization reported by the NVidia Control Panel jumps to 100%. The profiler reports that each kernel has a max Theoretical occupancy of 25%. I assume that means each kernel was launched in its own SMX (the Quadro K4000 has 4 SMXs).

I am compiling with Visual Studio 2013 with the -default-stream per-thread command line flag.

concurrent kernel execution is hard to witness. If you launch a large kernel, you probably won’t see any overlap.

Your kernels are launching 400 blocks. Those blocks will “fill up” pretty much any currently available GPU, preventing the 400 blocks from the next kernel from starting until most of the previous 400 blocks have finished.

The is a concurrent kernels sample app that you can study if you wish.

What stopped me from believing that is when I run


by itself, the profiler says that Theoretical occupancy is 25% and achieved occupancy is 25%.

I just changed the execution of my_kernel to:

dim3 threads(32, 4);
my_kernel << <448, threads >> >(parameter_list);

The theoretical max occupancy jumped to 75% and the achieved jumped to 74%. What would hold it back from a theoretical max of 100%? The only thing I can think of off the top of my head is that I have monitors attached to my GPU.

As @txbob says, it’s usually hard to witness concurrent kernels in action except on the border between grids.

However, if your grids and blocks are unusually small and don’t require much shared memory you can observe dense concurrent kernel execution.

Here’s an example showing 6 different “dummy” kernels each grabbing half of an sm_50 SMM (32KB) on a 3 SMM Quadro K620:

Each vertical slice of the NVVM timeline shows up to 6 kernels being executed. Sometimes 6 different kernels appear in one slice but the scheduler appears to bias toward launching more kernels of the same kind until it eventually must service older enqueued kernels… thus the interesting striping.

Indeed, small threadblocks will tend to limit your occupancy. Just because you are running at 25% occupancy does not mean that you have 75% of the GPU remaining to be used. If it were that simple then the scheduler would simply schedule more work.

A low occupancy often means you’ve run into an occupancy limit. Some typical limits are based on resources, including threads, warps, blocks, registers, shared memory, etc. as they pertain to the utilization of the resources available on an SM.

For the particular case of small threadblocks, you will run into the threadblock limit (e.g. 16) before you run into the warp or thread limit (64 warps, or 2048 threads). So because you have chosen 32 threads per block, and the SM is limited to a max of 16 blocks, you are running at 512 threads even though the theoretical max thread load would be 2048 – so you end up with 25% occupancy.

But this does not mean that you have 75% available. In fact you have used up the available limit of block scheduling, and no more blocks (of any kind, from any kernel) can be scheduled, until some of those 16 retire.

Okay, so it seems that by executing a small number of threads per block, I am wasting resources. If I were to bump the number of threads per block from 32 to 128, I should be able to achieve 100% Occupancy

So I modified the execution to look like:

dim3 threads(32, YDIM);
	dim3 blocks(8, 8);

my_kernel << <blocks, threads,0,cuda0 >> >(Parameters);

So I changed YDIM from 1 to 2 to 3 to 4. I got the following Utilization
YDIM Theoretical Max. Achieved
1 25% 25%
2 50% 50%
3 75% 73.5%
4 75% 75%

It would appear that from 1-3 the correct pattern is holding, but at 4 (which should be 100%), it breaks down.

Do you know what might cause this?

It is possible that something in the kernel itself is causing the problem. The kernel is set up to assume that it is being executed with 32 threads per block all in the x dimension. With a y dimension > 1, threads 0,32,64, … will all operate on the same data and do the same work (similarly for threads 1, 33,… and 2, 34, … , etc). Another possibility is that I am making calls to __sad which apparently can only do 32 at a time, as opposed to 160 for the normal int sum/diff operator. However, there is enough other work in that loop that the other warps should be able to do something else.

Is there a good tutorial on how to use the profiler?

Actually, it looks like the registers per thread usage was causing problems. I was using 33 registers per thread. At 65536 registers per block, and a max of 2048 threads per SMX, this comes out to 32 registers per thread. By having 33 registers per thread, it could not launch 2048 threads in an SMX.

Information for someone else who has a similar problem:
This was accomplished in Visual Studio by setting the properties for the CUDA file.

  1. Right click the file in Solution Explorer and select properties.
  2. Go to Configuration Properties -> Cuda C/C++ -> Device.
  3. In the line “Max Used Register” change to 32.

From a command line (inference based on information produced by Visual Studio):

  1. When compiling the particular file, pass it a flag -maxrregcount 32.
    I do not believe there would be a difference between Linux, Windows and Mac.

Error in previous post:
In my previous post I said that when I launched a YDIM of 4, I achieved an occupancy of 75%, it should have been 53.9%.

presumably you’ve run into some other occupancy limit. It’s difficult to say without actually having a complete code to analyze and the time/motivation to do it.

But for example, your Quadro K4000 has 4 SMs, each of which can “carry” 16 threadblocks, for a total theoretical carrying capacity of 64. It just so happens you are launching 64 threadblocks:

dim3 blocks(8, 8);

You might want to try launching more blocks. There may be launch scheduling or tail effects that are preventing full occupancy for such a small grid, especially if the duration of thread execution is short.

You might also be running into a registers limit. The total number of registers available on the SM is 64K 32-bit registers:


You can use the compiler (-Xptxas -v) to spit out the register per thread usage for each kernel, and/or the profiler will tell you the register per thread usage. At a maximum complement (2048) of threads, the SM would only have up to 32 registers per thread. If your kernel uses about 32 registers per thread or more than 32 registers per thread, this would be an occupancy limit that would prevent you from reaching “full” occupancy. If you’re just experimenting with things, you can attempt to force the compiler to use fewer registers per thread, to see if your reported occupancy improves.

These topics have been covered many times, in many forums, so a little google searching will likely turn up interesting things to read.

A google search on “nvidia profiler tutorial” turned this up in the first page of hits: