kenel overhead time in Jetson TX1?

hoonix · January 24, 2017, 7:06am

Hi everyone!

I am reviewing CUDA performance in Jetson TX1.

The goal of project is that Repeating operations in a short time.(less than 4 us)

We measured the kernel function(empty) loading time.( ubuntu GUI stop)

Ave: 0.658500 ms

This time is seems to be longer than I thought.

Reference site:
https://www.cs.virginia.edu/~mwb7w/cuda_support/kernel_overhead.html

Does my board have a problem?

Has anyone ever measured it?

Attach the test code.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cuda.h>
//#include <cutil.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <helper_timer.h>

StopWatchInterface *timer = NULL;

void startTimer()
{
	sdkResetTimer(&timer);
	sdkStartTimer(&timer);
}

void endTimer(const char*str)
{
//	cudaThreadSynchronize();
	sdkStopTimer(&timer);
	float elapsed_time = sdkGetTimerValue(&timer);
	printf("[%f] - %s\n", elapsed_time, str);
}

float getEndTimer(void)
{
	cudaThreadSynchronize();
	sdkStopTimer(&timer);
	return sdkGetTimerValue(&timer);
}

__global__ void kernel(int x, int y, int z)
{
	
}


int main(int argc, char** argv) {

    float min, max;
    float sum = 0;
    float elapsed_time;
    int count = 10;

	sdkCreateTimer(&timer);

	startTimer();
	kernel<<<750,1024>>>(1,2,3); 
	endTimer("first call");

	for(int i = 0; i < count; i++)
	{
		startTimer();
		kernel<<<750,1024>>>(1,2,3); 
		elapsed_time = getEndTimer();

		sum += elapsed_time;

		if(i == 0) 
			min =max = elapsed_time;
		else if(elapsed_time > max)
			max = elapsed_time;
		else if(elapsed_time < min)
			min = elapsed_time;
		printf("%f\n", elapsed_time);
	}

	printf("count:%d max:%f min:%f ave:%f\n", count, max, min, sum/count);

	sdkDeleteTimer(&timer);
}

kayccc · January 24, 2017, 7:16am

Hi hoonix,

Please fix the CPU/GPU/Mem to max freq and try it again, you could refer to that attached script at following thread:
[url]https://devtalk.nvidia.com/default/topic/970628/jetson-tx1/speed-of-gie-on-tx1-is-unexpected-/post/4994855/#4994855[/url]
Or
You could refer to TX1 wiki:
[url]http://elinux.org/Jetson/TX1_Controlling_Performance[/url]

Thanks

hoonix · January 24, 2017, 8:10am

Thanks kayccc

This script is effective.

But it is not a satisfactory level.

At present it is about avg:128us.

According to the reference URL, should not it be less than 10us?

Have you ever tested it in JETSON TX1?

Thanks.

hoonix · January 26, 2017, 2:10am

Hi everyone!!!

We are still testing the kernel loading time.

The kernel loading time is much better than I thought.

Are there any official benchmark tools?

I want to test whether my system is normal.

Thanks.

AastaLLL · January 26, 2017, 2:45am

Hi,

If you want to measure the pure kernel loading time, it’s better to make sure launched jobs can fill into hw at the same time.
Otherwise, longer execution time for waiting for available hw is expected.

Query hw compansity

./NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery/deviceQuery

Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

And synchronize also paid for waiting.

line21: //  cudaThreadSynchronize();
line29: //  cudaThreadSynchronize(); 
line56:     kernel<<<63,1024>>>(1,2,3);

count:10 max:0.032000 min:0.024000 ave:0.026400

Topic		Replies	Views
CUDA Kernel loads slow Jetson TX1	1	493	January 16, 2018
Improve kernel launch times on Jetson TX2? Jetson TX2	4	601	October 18, 2021
Performance spikes on Jetson TX1 using CUDA multithreading Jetson TX1	2	714	October 18, 2021
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1718	July 19, 2022
Persistent kernel runs slower when with more threads CUDA Programming and Performance	7	45	October 2, 2024
Variable run time for cuda kernel Jetson AGX Orin cuda	3	1125	March 2, 2023
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7776	March 8, 2013
Persistent kernel runs slower when with more threads Jetson Orin NX cuda	6	49	October 14, 2024
Kernel performance when switching compute capability from 3.0 to 6.2 on Jetson Tx2 Jetson TX2 cuda , kernel , performance , nvcc , jetson	8	732	April 25, 2023
TX2 Computing Performance has Dropped Jetson TX2 power , performance	12	963	October 18, 2021

kenel overhead time in Jetson TX1?

Related topics