cost for launching (a lot of) CUDA kernels

Hi,

I’m writing an application, which needs to launch many kernels (163840). There are 512 different tasks, which means, that I have 320 kernel launches per task, that are not independent and must be executed in order. I use streams to allow the 512 tasks to be processed in parallel. Here is the minimal code snipped (that actually does nothing).

#include <stdio.h>

#include <cuda.h>

#include <stdlib.h>

__global__ void test_kernel() {}

int main()

{

	clock_t t = clock();

	cudaStream_t s[ 512 ];

	for( int i = 0; i < 512; i++ )

	{

		cudaStreamCreate( &s[ i ] );

		for( int j = 0; j < 320; j++ )

			test_kernel<<< dim3( 20, 5 ), dim3( 32, 16 ), 0, s[ i ] >>>();

	}

	printf( "initialization done after %fs\n", ( clock() - t ) / ( float ) CLOCKS_PER_SEC ); fflush( stdout );

	cudaThreadSynchronize();

	printf( "computation done after %fs\n", ( clock() - t ) / ( float ) CLOCKS_PER_SEC ); fflush( stdout );

	return 0;

}

Output:

initialization done after 0.906000s

computation done after 0.922000s

I always thought, that the CUDA calls are executed in the background. But instead the whole initialization takes about 1 second. If the kernel has additional parameters, the situation is even worse. The call to cudaThreadSynchronize finishes (almost) immediately. It seams, that queuing the kernel launches takes a lot of time. If I remove the kernel launches, the program finishes after a couple of milliseconds.

The kernel itself is a rather short program (in the example above it’s even empty). So the total time for the initialization seems to be longer, than the time for the actual execution of the kernels.

Does anyone have an idea, how to speed up the process?

Thanks,

porst17

as far as I remember the queue is like 20 kernels deep. so after 20 kernel launches, each subsequent kernel launch has to wait for the N-20th kernel to finish.
So the cudathreadsynchronize only has to wait for 20 kernes to be launched and finished.

Thanks for your reply.

Hmmm … 20 kernel launches is not that much, especially if I have a total number of 163840 launches. But I still wonder, why the kernel launches take so long. What is the real problem here? What happens, if the queue already contains 20 elements and I launch another kernel? Does the applicaiton block until 1 kernel has finished? In that case most of the time is used for (synchronization) communication between host and device, isn’t it?

Regards,
porst17

Thanks for your reply.

Hmmm … 20 kernel launches is not that much, especially if I have a total number of 163840 launches. But I still wonder, why the kernel launches take so long. What is the real problem here? What happens, if the queue already contains 20 elements and I launch another kernel? Does the applicaiton block until 1 kernel has finished? In that case most of the time is used for (synchronization) communication between host and device, isn’t it?

Regards,
porst17

As far as I understand: yes. Kernel launches are quite quick, I believe I have seen numbers like 5 microseconds floating around. cudathreadsynchronize is quite expensive, and if you run out of the queue you have implicit cudathreadsynchronize’s.

You can test how deep it is I think by printf-ing the time and kernel number having been launched. It will be very fast until a certain number. If your kernels take a lot of time, that number will be the queue depth :)

As far as I understand: yes. Kernel launches are quite quick, I believe I have seen numbers like 5 microseconds floating around. cudathreadsynchronize is quite expensive, and if you run out of the queue you have implicit cudathreadsynchronize’s.

You can test how deep it is I think by printf-ing the time and kernel number having been launched. It will be very fast until a certain number. If your kernels take a lot of time, that number will be the queue depth :)