Performance spikes on Jetson TX1 using CUDA multithreading

Calling below function on Jetson TX1 is causing performance spikes on a thread, but it works on main. Removing cudaDeviceSynchronize() gives stable timing on the thread because most probably it’s not measuring the kernel execution time.

std::thread t;

int Start()
{
while(true)
{
if(stop)
break;
}

int tick1 = std::chrono::duration_cast<std::chrono::microseconds(std::chrono::high_resolution_clock::now().time_since_epoch()).count();

for(int i=0; i<20; i++)
    estimateStep();

int tock1 = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now().time_since_epoch()).count();
std::cout << "ICP Timing: " << ((tock1 - tick1) / 1000.0f) << std::endl;

}

extern “C” int Start_RS()
{
t = std::thread(Start); //Adding t.join after this command or anyother blocking call gives stable timing a well
//Start(); //Calling this directly also gives stable timing
return 0;
}

extern “C” int Stop_RS()
{
stop = true;
t.join();
return 0;
}

//estimate.cu

global void estimateKernel()
{
//Test
}

void estimateStep()
{
estimateKernel<<<50, 500>>>();
cudaDeviceSynchronize();
}

//Timing (ms): Calling Start function in thread
ICP Timing: 0.651
ICP Timing: 0.859
ICP Timing: 9.032
ICP Timing: 66.029
ICP Timing: 57.577
ICP Timing: 64.601
ICP Timing: 60.967
ICP Timing: 64.266
ICP Timing: 48.34
ICP Timing: 35.074
ICP Timing: 33.066
ICP Timing: 39.791
ICP Timing: 0.664
ICP Timing: 0.623
ICP Timing: 0.619

//Timing (ms): Calling Start function from main
ICP Timing: 4.565
ICP Timing: 3.899
ICP Timing: 2.383
ICP Timing: 4.519
ICP Timing: 2.559
ICP Timing: 5.237
ICP Timing: 2.647
ICP Timing: 3.334
ICP Timing: 3.653
ICP Timing: 5.125
ICP Timing: 3.482
ICP Timing: 2.377
ICP Timing: 4.581
ICP Timing: 2.592
ICP Timing: 5.603
ICP Timing: 2.374
ICP Timing: 3.87
ICP Timing: 2.923
ICP Timing: 5.239

I have also tried maximizing CPU performance as mentioned Jetson/Performance - eLinux.org, but results are same.

Any suggestions, please.

Hi,

Looks like ‘estimateKernel<<<50, 500>>>()’ has room for tuning.
You can run the ./deviceQuery located at ‘~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery’ to get more GPU hardware information.

Here is some tutorial for your reference:
Course: Udacity
Doc: Maxwell Tuning Guide :: CUDA Toolkit Documentation