Why streams cant run concurrently

I’m working on a ray tracer using CUDA 8.0 and the call of the kernel is:

extern "C" float CudaRender(CudaScene& scene, int w, int h, CudaVec* output)
	if (output == nullptr) return 0.0;
	if (scene.geolist.size() == 0 && scene.spherelist.size() == 0) return 0.0;

	CudaVec* dev_result = nullptr;
	cudaMalloc(&dev_result, sizeof(CudaVec)*w*h);

	//CudaVec* dev_result_temp = nullptr;
	//cudaMalloc(&dev_result_temp, sizeof(CudaVec)*w*h*SampleNum);

	float elapsed = 0;
	cudaEvent_t start, stop;
	cudaEventRecord(start, 0);
	cudaStream_t stream[CUDA_XN*CUDA_YN];

        // dev_randstates is already generated using curand_init in another kernel.
	for (int i = 0; i < CUDA_XN; i++)
		for (int j = 0; j < CUDA_YN; j++)
			cudaStreamCreate(&stream[i + CUDA_XN * j]);
			CudaMonteCarloRender << <dim3(w/CUDA_XN, h/CUDA_YN), SampleNum, 0, stream[i + CUDA_XN * j] >> > (
				w/ CUDA_XN*i, h/ CUDA_YN*j,
				dev_geos, scene.geolist.size(), // all the triangles and the number of triangles
				dev_spheres, scene.spherelist.size(), // all the spheres and the number of spheres
				scene.camera, w, h, dev_result, dev_randstates);

	for (int i = 0; i < CUDA_XN*CUDA_YN; i++)
	cudaEventRecord(stop, 0);
	cudaEventElapsedTime(&elapsed, start, stop);
	printf("The elapsed time in gpu was %.2f ms", elapsed);

	cudaMemcpy(output, dev_result, sizeof(CudaVec)*w*h, cudaMemcpyDeviceToHost);


	return elapsed;

According to the programming guide, kernels should run concurrently on the device, just as the concurrentkernel sample works.But the program above doesn’t work on my computer. I use vs2015 and cuda 8.0 on a GTX1050 Ti card. I don’t know how to insert image here but in profier it looks like this:

stream14 ====
stream15     ====
stream16         ====
stream17             ====
stream18                 ====
stream19                     ====
stream20                           ====
stream21                                ====

Streams are created but they don’t run in parallel. The grid size is [30, 30, 1] and block size is [8, 1, 1]. I think this is definitely not too large for my card. So why is this? I’ve read concurrentkernel sample and it works for me, but I can’t figure out waht’s wrong with this


30x30 = 900 blocks is enough to fill up many GPUs, and prevent much overlap of kernels. The concurrentKernel sample code does not launch kernels that are that large. The GPU does not have infinite resources.

Also, a block size of 8 threads is relatively inefficient for GPU processing. You might want to investigate ways to see if you can get that up to 64 or larger.

since warp size is 32, block size of 8 don’t use 3/4 of GPU resources at all

The maximum number of resident blocks per multiprocessor is 32:


this is independent of block size.

The 1050 Ti has 6 SMs. 6 SMs x 32 blocks is a maximum load of 192 blocks. A GPU would need more than 28 SMs before a kernel launch of 900 blocks would not “fill” it.

I claim that a kernel launch of 900 blocks could easily fill that GPU.

Even if we only had the warp limit, it is 64. That means 384 blocks max for that GPU, before it is “full”.

i don’t disagree with you, i just mentioned to topic starter that he can’t use all GPU resources with 8-wide block, so he don’t take light your suggestion to use larger blocks