I can't realize the kernel concurrent with Hyper-Q

I have tried to profile my application with HyperQ, and I also see the simpleHyperQ in cuda samples, I don’t know my understand to this technology is right or not. I have analysed my application but find no kernel concurrent excution
The following is my test code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "helper_cuda.h"

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void init(float *p, int m, int n)
	for (int i = 0; i < m; i++) {
		for (int j = 0; j < n; j++) {
			p[i * n + j] = i * n + j;

void printArray(float *p, int m, int n)
	for (int i = 0; i < m; i++) {
		for (int j = 0; j < n; j++) {
			printf("%f\t", p[i * n + j]);

__global__ void addSub_kernel(float* dev_Xl, float* dev_Xr, int width, int height, char opera)
	unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
	unsigned int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
	if ((xIndex < width) && (yIndex < height))
		unsigned int index_in = yIndex * width + xIndex;
		if (opera == '+')
			dev_Xl[index_in] += dev_Xr[index_in];
		else if (opera == '-')
			dev_Xl[index_in] = dev_Xl[index_in] - dev_Xr[index_in];

__global__ void sumReduction_kernel(float* out, float* in, int m, int n)
	extern __shared__ float temp1[];
	float sum = 0.0;
	if (blockIdx.x < 8) {
		for (int i = threadIdx.x; i < n; i += blockDim.x)
			int index = blockIdx.y * n + i;
			sum += in[index];
		temp1[threadIdx.x] = sum;

	for (int offset = blockDim.x / 2; offset > 0; offset >>= 1)
		if (threadIdx.x < offset) {
			temp1[threadIdx.x] += temp1[threadIdx.x + offset];
	if (threadIdx.x == 0)
		out[blockIdx.y] = temp1[0] / n;

int main()
	const int T = 6;
	int m = 2000; 
	int n = 4096;
	int mem_size = sizeof(float) * m * n;
	float *h_A = (float*)malloc(mem_size);
	float *h_B = (float*)malloc(mem_size);
	init(h_A, m, n);
	init(h_B, m, n);
	float *d_A, *d_B;
	checkCudaErrors(cudaMalloc((void**)&d_A, mem_size));
	checkCudaErrors(cudaMalloc((void**)&d_B, mem_size));
	checkCudaErrors(cudaMemcpy(d_A, h_A, mem_size, cudaMemcpyHostToDevice));
	checkCudaErrors(cudaMemcpy(d_B, h_B, mem_size, cudaMemcpyHostToDevice));
	dim3 grid(8, 2000);
	dim3 block(512, 1);
	int sharedSize = block.x * sizeof(float);
	float *d_C;
	int mem_size_C = sizeof(float) * m;
	checkCudaErrors(cudaMalloc((void**)&d_C, mem_size_C));
	float totalTime = 0.0;
	cudaStream_t *streams = (cudaStream_t *)malloc(sizeof(cudaStream_t) * T);
	cudaEvent_t start, stop;
	for (int i = 0; i < T; i++)

	for (int i = 0; i < T; i++)
		cudaEventRecord(start, 0);
		//addSub_kernel << <grid, block >> > (d_A, d_B, n, m, '+');
		addSub_kernel << <grid, block, 0, streams[i] >> > (d_A, d_B, n, m, '+');
		//sumReduction_kernel << <grid, block, sharedSize >> > (d_C, d_A, m, n);
		sumReduction_kernel << <grid, block, sharedSize, streams[i] >> > (d_C, d_A, m, n);
		cudaEventRecord(stop, 0);
		float elapsedTime;
		cudaEventElapsedTime(&elapsedTime, start, stop);
		totalTime += elapsedTime;
		printf("the %dth computation time use = %.3fms\n", i, elapsedTime);
		float *h_C = (float*)malloc(mem_size_C);
		checkCudaErrors(cudaMemcpyAsync(h_C, d_C, mem_size_C, cudaMemcpyDeviceToHost, streams[i]));

	for (int i = 0; i < T; i++)

	printf("all the computation had done!\n");
	printf("total time use = %.3fms\n", totalTime);
	return 0;


Any suggestion is appreciated

Your application won’t have any possibility to run kernels concurrently. What were you expecting to run concurrently?

I’m so sorry, I have been wanted to realize the concurrent of the two kernels with Hyper-Q.But I suddenly find that the execution of the second kernel need the result of the first kernel, so there is no possibility to run the two kernels concurrently. Is this right? maybe I’m too obsession on the problem of optimization with Hyper-Q, Thank you for your reply…

No, but how can I realize the parallelism of different streams like the result of the ‘simpleHyperQ’ in CUDA Samples. In this example, there is no concurrent execution between the two kernels, right?

For a given iteration of the for-loop, the two kernel calls within that loop iteration will not run concurrently with each other, because they are launched into the same stream. Stream semantics dictate that all work issued to a given stream will be executed in-order (serialized)


When considering the kernels associated with a given loop iteration (say, i) and another loop iteration (say, i+1), these kernels cannot run concurrently because there is an intervening synchronizing call:

.		addSub_kernel << <grid, block, 0, streams[i] >> > (d_A, d_B, n, m, '+');
		//sumReduction_kernel << <grid, block, sharedSize >> > (d_C, d_A, m, n);
		sumReduction_kernel << <grid, block, sharedSize, streams[i] >> > (d_C, d_A, m, n);
		cudaEventRecord(stop, 0);
		cudaEventSynchronize(stop);  //this call forces the CPU thread to halt, until the stop event occurs

The cudaEventSynchronize() call is a CPU/GPU synchronizing call. It forces the CPU thread to wait at that point, until the indicated event has occurred, meaning until the GPU activity has progressed to the point of the event. Therefore, the GPU must complete both kernels in that loop iteration, before the CPU thread can proceed to the next iteration. Therefore there is no possibility for the kernels of a given iteration to overlap with the next (or any other) iteration.

If your desire is to witness concurrent kernel execution, I would recommend studying the concurrentKernels sample code. In addition to the requirement for correct stream usage and (lack of) synchronization, actually witnessing concurrent kernel execution requires kernels with relatively small resource utilization, such that 2 (or more) kernels can “occupy” the GPU at the same time. The GPU has a number of resources required for kernel execution, including, but not limited to:

-register space
-shared memory space
-instantaneous block capacity
-instantaneous thread capacity

If a given kernel launch saturates any of the above limits (or possibly others) it will prevent any subsequent kernel from beginning to use the GPU, until sufficient resources free up. Let’s take one example. Your kernels have the following block dimensions:

dim3 grid(8, 2000);

That is 16,000 blocks, in total. The (maximum, i.e. upper bound) instantaneous block capacity of a GPU is equal to the maximum number of blocks per SM, multiplied by the number of SMs in your GPU. Both of these are available either from deviceQuery or from the programming guide tables for your GPU. In any event these numbers might both be in the range of 16, for example. In that case, your GPU would have a block capacity of 16x16 = 256 blocks that could be available at any instant to a warp scheduler. Other blocks beyond that will wait in a queue, until “block space” frees up due to retiring blocks. Therefore we see that your kernels of 16000 blocks will “fill up” any current GPU (even Volta, GV100, with 80SMs would probably only support a maximum of ~2500 blocks at any moment), and will prevent, for the majority of kernel execution time, the possibility of any blocks from other kernels that are launched later, to execute.

concurrent kernel execution in practice is quite hard to witness, and requires carefully crafted code to achieve.

A related thought that people sometimes have is that once they get their CUDA code running they often begin to think about running multiple copies, in “parallel” streams, believing that this will magically allow the GPU to do twice as much work, with no thought or awareness given to the above considerations. This is of course a completely misguided model. If the underlying thesis were true, then we could carry it out ad-infinitum, and put an arbitrarily large amount of work on the GPU, and break it up into streams, with no limit to the achievable performance. Such a conclusion immediately shows the fallacy intrinsic to this thought process.

For any kernel that sufficiently utilizes GPU resources, concurrency is generally not possible, and would offer no significant improvement in performance anyway. If your kernel does not sufficiently use resources, the CUDA programmer’s first task is to expose more parallelism. Concurrency is a less efficient, last-gasp methodology to attempt to extract performance from the GPU when the quantum of work being issued is effectively, “too small”.

First sorry for I have not seen the reply in time and thank you for your detailed reply. I am not so clear about the block capacity of the device. I have seen there is 10 SMs on the device use deviceQuery, but which one represent the numbers of blocks per SM(gtx 1060).

In my cuda application I always use one thread correspond to one element and just do once computation, now I don’t know if this way is a suitable way to deal with one question.(for as you have say, the block number in all my application had over the maximum block capacity). it will be better if there is a loop for one thread to read multiple matrix element?

Finally, I have multiple datasets, I have realize the parallel computation of one dataset, if i want to compute all datasets at the same time.Can yuo give me some suggestion about the block and thread allocation, I am a little confused about the device how to select different dataset do computation.
Thank you once again~

Get the compute capability of your device from deviceQuery (it is 6.1). Then refer to this table:


Look at the row entitled:

“Maximum number of resident blocks per multiprocessor”

Then look at the column with your compute capability.

So its 32

There is nothing wrong with launching a kernel with more blocks than the maximum instantaneous capacity. In fact, in CUDA, that may be a good thing.

You may want to study GPU execution architecture and performance optimization.


OK, I almost got what you have said.I will try to do the performance optimization.

Thanks a lot