kenel overhead time in Jetson TX1?

Hi everyone!

I am reviewing CUDA performance in Jetson TX1.

The goal of project is that Repeating operations in a short time.(less than 4 us)

We measured the kernel function(empty) loading time.( ubuntu GUI stop)

Ave: 0.658500 ms

This time is seems to be longer than I thought.

Reference site:

Does my board have a problem?

Has anyone ever measured it?

Attach the test code.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
//#include <cutil.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <helper_timer.h>

StopWatchInterface *timer = NULL;

void startTimer()

void endTimer(const char*str)
//	cudaThreadSynchronize();
	float elapsed_time = sdkGetTimerValue(&timer);
	printf("[%f] - %s\n", elapsed_time, str);

float getEndTimer(void)
	return sdkGetTimerValue(&timer);

__global__ void kernel(int x, int y, int z)

int main(int argc, char** argv) {

    float min, max;
    float sum = 0;
    float elapsed_time;
    int count = 10;


	endTimer("first call");

	for(int i = 0; i < count; i++)
		elapsed_time = getEndTimer();

		sum += elapsed_time;

		if(i == 0) 
			min =max = elapsed_time;
		else if(elapsed_time > max)
			max = elapsed_time;
		else if(elapsed_time < min)
			min = elapsed_time;
		printf("%f\n", elapsed_time);

	printf("count:%d max:%f min:%f ave:%f\n", count, max, min, sum/count);


Hi hoonix,

Please fix the CPU/GPU/Mem to max freq and try it again, you could refer to that attached script at following thread:
You could refer to TX1 wiki:


Thanks kayccc

This script is effective.

But it is not a satisfactory level.

At present it is about avg:128us.

According to the reference URL, should not it be less than 10us?

Have you ever tested it in JETSON TX1?


Hi everyone!!!

We are still testing the kernel loading time.

The kernel loading time is much better than I thought.

Are there any official benchmark tools?

I want to test whether my system is normal.



If you want to measure the pure kernel loading time, it’s better to make sure launched jobs can fill into hw at the same time.
Otherwise, longer execution time for waiting for available hw is expected.

Query hw compansity


Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

And synchronize also paid for waiting.

line21: //  cudaThreadSynchronize();
line29: //  cudaThreadSynchronize(); 
line56:     kernel<<<63,1024>>>(1,2,3);

count:10 max:0.032000 min:0.024000 ave:0.026400