Why streams cant run concurrently

Hamakaze · March 20, 2018, 11:45am

I’m working on a ray tracer using CUDA 8.0 and the call of the kernel is:

extern "C" float CudaRender(CudaScene& scene, int w, int h, CudaVec* output)
{
	printf("CudaRender\n");
	fflush(stdout);
	if (output == nullptr) return 0.0;
	if (scene.geolist.size() == 0 && scene.spherelist.size() == 0) return 0.0;

	CudaVec* dev_result = nullptr;
	cudaMalloc(&dev_result, sizeof(CudaVec)*w*h);

	//CudaVec* dev_result_temp = nullptr;
	//cudaMalloc(&dev_result_temp, sizeof(CudaVec)*w*h*SampleNum);

	float elapsed = 0;
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaEventRecord(start, 0);
	fflush(stdout);
	cudaStream_t stream[CUDA_XN*CUDA_YN];

        // dev_randstates is already generated using curand_init in another kernel.
	for (int i = 0; i < CUDA_XN; i++)
	{
		for (int j = 0; j < CUDA_YN; j++)
		{
			cudaStreamCreate(&stream[i + CUDA_XN * j]);
			CudaMonteCarloRender << <dim3(w/CUDA_XN, h/CUDA_YN), SampleNum, 0, stream[i + CUDA_XN * j] >> > (
				w/ CUDA_XN*i, h/ CUDA_YN*j,
				dev_geos, scene.geolist.size(), // all the triangles and the number of triangles
				dev_spheres, scene.spherelist.size(), // all the spheres and the number of spheres
				scene.camera, w, h, dev_result, dev_randstates);
		}
	}
	cudaDeviceSynchronize();

	for (int i = 0; i < CUDA_XN*CUDA_YN; i++)
	{
		cudaStreamDestroy(stream[i]);
	}
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&elapsed, start, stop);
	cudaEventDestroy(start);
	cudaEventDestroy(stop);
	printf("The elapsed time in gpu was %.2f ms", elapsed);

	fflush(stdout);
	printf("Rendered\n");
	fflush(stdout);
	cudaMemcpy(output, dev_result, sizeof(CudaVec)*w*h, cudaMemcpyDeviceToHost);

	cudaFree(dev_result);

	return elapsed;
}
}

According to the programming guide, kernels should run concurrently on the device, just as the concurrentkernel sample works.But the program above doesn’t work on my computer. I use vs2015 and cuda 8.0 on a GTX1050 Ti card. I don’t know how to insert image here but in profier it looks like this:

Streeam
default
stream14 ====
stream15     ====
stream16         ====
stream17             ====
stream18                 ====
stream19                     ====
stream20                           ====
stream21                                ====

Streams are created but they don’t run in parallel. The grid size is [30, 30, 1] and block size is [8, 1, 1]. I think this is definitely not too large for my card. So why is this? I’ve read concurrentkernel sample and it works for me, but I can’t figure out waht’s wrong with this

Thanks!!.

Robert_Crovella · March 20, 2018, 4:39pm

30x30 = 900 blocks is enough to fill up many GPUs, and prevent much overlap of kernels. The concurrentKernel sample code does not launch kernels that are that large. The GPU does not have infinite resources.

Also, a block size of 8 threads is relatively inefficient for GPU processing. You might want to investigate ways to see if you can get that up to 64 or larger.

BulatZiganshin · March 21, 2018, 9:24pm

since warp size is 32, block size of 8 don’t use 3/4 of GPU resources at all

Robert_Crovella · March 22, 2018, 1:08am

The maximum number of resident blocks per multiprocessor is 32:

[url]Programming Guide :: CUDA Toolkit Documentation

this is independent of block size.

The 1050 Ti has 6 SMs. 6 SMs x 32 blocks is a maximum load of 192 blocks. A GPU would need more than 28 SMs before a kernel launch of 900 blocks would not “fill” it.

I claim that a kernel launch of 900 blocks could easily fill that GPU.

Even if we only had the warp limit, it is 64. That means 384 blocks max for that GPU, before it is “full”.

BulatZiganshin · March 22, 2018, 1:26am

i don’t disagree with you, i just mentioned to topic starter that he can’t use all GPU resources with 8-wide block, so he don’t take light your suggestion to use larger blocks

Topic		Replies	Views
Kernel launch concurrency CUDA Programming and Performance	10	1807	December 11, 2014
My streams are not running concurrently CUDA Programming and Performance	7	1795	March 6, 2018
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2781	April 29, 2019
I can't realize the kernel concurrent with Hyper-Q CUDA Programming and Performance	7	888	July 27, 2017
Streams and multiprocessor usage? CUDA Programming and Performance	3	2899	September 20, 2008
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	908	June 18, 2010
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	516	May 13, 2018
Concurrent executions of streams CUDA Programming and Performance	6	422	December 19, 2022
Concurrent kernel execution CUDA Programming and Performance	2	392	March 26, 2024
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1245	November 19, 2010

Why streams cant run concurrently

Related topics