Why CUDA kernel calls takes so long?

I have set of a very small kernel in CUDA kernel.

The reason they are so small is because I need the global synchronizations between operations so I split the code into different kernels to achieve this. In this way, for instance you make sure all the blocks of threads finish before starting new ones.

Here is an example of a kernel I use:

__global__ void step_4b(){
	int l = blockDim.x * blockIdx.x + threadIdx.x;

	int c0 = column_of_zero_at_row[l];
	if (c0>=0)
	{
		column_of_prime_at_row[l] = c0;
		int c = column_of_star_at_row[l];
		if (c >= 0) {
			cover_row[l] = 1;
			cover_column[c] = 0;
			found = false;
		}
		else
			goto_5 = true;
	}

	column_of_zero_at_row[l] = -1;
}

The problem, is that this kernels are taking 50 to 150 us to run, while they just do a bunch of global memory reads and writes. Assuming a global memory latency of about 0.5 us, this would give at most 4us for 8 accesses even assuming they are all performed serially.

So, the time the kernels take can only be due to kernel call overhead and synchronization. Why does it take so long? Since the kernels are run many times I would expect that the kernels are compiled and load into the GPU memory and that the call would just take a simple command. Regarding synchronizations, pending writes to global memory need to terminate need to terminate, but that is about it.

We also, tried to use dynamic parallelism to take the operating system out of the loop but there is little of no improvement on this.

windows wddm can introduce considerable overhead, possibly up to 50us for a kernel call.

I also don’t think your assumptions about memory latency and how you can compute kernel performance from them make any sense at all without a lot more data. Even a small kernel with just a few operations can be bandwidth-bound, not latency-bound, depending on the exact access pattern (impossible to determine from your code) and the scope of the grid (impossible to determine from what you have described.) So asking others to explain something that may not be true, with incomplete data, may not be very productive.

Ok, I wont to rephrase the question:

I have set of a very small kernel in CUDA kernel.

The reason they are so small is because I need the global synchronizations between operations so I split the code into different kernels to achieve this. In this way, for instance you make sure all the blocks of threads finish before starting new ones.

I measured the time it takes to run a empty kernel to get an estimate of the kernel launch overhead. I got about 8 us. Why does it take so long? Since the kernels are run many times I would expect that the kernels are compiled and load into the GPU memory and that the call would just take a simple command. Regarding synchronizations, pending writes to global memory need to terminate need to terminate, but that is about it. Also we use dynamic parallelism to take the operating system out of the loop.

Here is the code I used to measure the time to run the empty kernel:

// Lauch_overhead -- 17-7-2017
// To test the kernel/block launch overhead.
// Uses dynamic parallelism

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <device_functions.h>
#include <stdlib.h>
#include <stdio.h>


const int n = 1024;
const int n_threads = 64;
const int n_blocks = n / n_threads;

// -------------------------------------------------------------------------------------
// Device code
// -------------------------------------------------------------------------------------

__global__ void kernel(){
}

// Returns the current time in clocks
__device__ inline long long int d_get_globaltime(void) {
	long long int ret;
	asm volatile ("mov.u64 %0, %%globaltimer;" : "=l"(ret));
	return ret;
}

// Returns the period in miliseconds
__device__ inline double d_get_timer_period(void) {
	return 1.0e-6;
}

__global__ void Algorithm()
{
	long long time_start = d_get_globaltime();
		
	for (int i = 0; i < 1000; i++) {
		kernel << < n_blocks, n_threads >> > ();
		// cudaDeviceSynchronize(); This doubles the time!
	}

	long long time_stop = d_get_globaltime();

	printf("Total time(ms) \t %g\n", d_get_timer_period() * (time_stop - time_start));
}

int main()
{
	Algorithm << <1, 1 >> > ();
	cudaDeviceSynchronize();
}

This was run on a Windows Laptop with Visual Studio and a GeForce 930M GPU, but we also tried in different PC ans GPUs with similar results.