Latency when running a cuda code

Hello

I am a cuda beginner and I am facing issue optimizing a algo. The original algo takes 7,812e-6 s. I am trying to parallelize this algo ( the function is used 1.e9 times at least ).
My first question will be : is it relevant to parallelize such algorithm ?

I have coded and run a parallelized version of this program but I haven’t been able to have a better execution time than 6.e-5s. I try to remove all my code (Basically I have 1 almost empty kernel and 1 cuda copy) and I haven’t been a have a better execution under 5.e-5 s. Is there some latency due to coda runtime environnement or is it possible to have execution under 10e-6 s ?

Thank you for your expertise.
Regards

High speed trader?

depending on your hardware you may have the possibility to share some system memory between host CPU and GPU.

If this is the case, consider running the kernel in a spinloop (be sure to disable the GPU watchdog timer), and starting the computation only when the CPU has copied the input data into the segment of mapped memory (use proper locking techniques to ensure the data is available in full).

This avoids the kernel launch overhead entirely. Finalization of the computation has to be signalled by the GPU by setting some completion flag.

Only expect latency speed-ups over the CPU if your algorithm is extremely parallelizable (let’s say a monte carlo simulation where each thread will compute one sample). Also for a fair comparison, consider doing an AVX2-enabled version of the same code, running on all CPU cores simultaneously. I bet it will be difficult to see the GPU being much faster in this case.

Can you show your code? 50 microseconds for an empty kernel and a copy sounds a bit on the high side. Are you using a Windows platform with default WDDM driver by any chance? If so, you would want to switch to the TCC driver (note: not supported by consumer GPUs best I know), or a Linux platform.

The minimal launch latency for an empty kernel should be around 5 microseconds with the TCC driver or Linux. Copy speed will depend on the transfer size. In general you would want to construct a processing pipeline with complete overlap between copies and kernel execution by using asynchronous copies and streams. You may need dual copy engines for best results as this facilitates simultaneous CPU->GPU and GPU->CPU copies.

High CPU core clocks and full PCIe 3 support will tend to keep driver and hardware overhead as low as possible. In terms of drivers, recent drivers appear to be better optimized for various synchronizations, so it may help to use the latest driver.

On a general note I will point out that GPUs are designed as very high-throughput, not ultra low-latency devices. For tasks that require extremely low end-to-end latency, the time required for transporting data to the GPU and back may already exceed the required response time.

Hi, sorry for the delay

I am not working for high speed trading but in biomedical imaging. My original project was an improvement of performances for ray tracing algorithm.

I am working on Linux and with a Tesla M2050

Call of the kernel + the cudaMemcpy :

HANDLE_ERROR(cudaMemcpyToSymbol(aMainEntryPoint,
					aP1,
					3*sizeof(float))) ;
	HANDLE_ERROR(cudaMemcpyToSymbol(aMainExitPoint,
					aP2,
					3*sizeof(float))) ;
        TestLatency<<<Blocks,Threads>>> (Axis0,
			   Axis1,
		           Axis2,
			   pCudaData->aDeviceVoxelList, 
		           pCudaData->aDeviceDim ,
			   RoundingRequired,
			   minPoint,
			   maxPoint ) ;
	HANDLE_ERROR(cudaMemcpy(aHostVoxelList,
		     pCudaData->aDeviceVoxelList,
		NbVoxel*sizeof(strucVoxel),
		cudaMemcpyDeviceToHost)) ;
	

	HANDLE_ERROR(cudaEventRecord(CudaStop,0));
	HANDLE_ERROR(cudaEventSynchronize(CudaStop)) ;
	HANDLE_ERROR(cudaEventElapsedTime(&CudaTimeSpent,CudaStart,CudaStop)) ;
	cerr << "Execution GPU time = " << CudaTimeSpent << " (ms) " << endl ;
		/* free cuda memory */
	HANDLE_ERROR(cudaEventDestroy(CudaStart));
	HANDLE_ERROR(cudaEventDestroy(CudaStop));

Kernel code :

__global__ void TestLatency  (unsigned int   Axis0,
			      unsigned int   Axis1,
			     unsigned int   Axis2,
				    strucVoxel   * pVoxel , 
				    unsigned int   aDim[3] ,
				    bool         RoundingRequired,
			            float    minPoint,
			            float    maxPoint ) {

}

The allocation/ free of device buffer is done before and after time measurement.

I have done an hundred of shouting and I have an average execution time of 0.05 ms

If you need other information let me know.

Regards

Maybe you don’t “cook” your kernel before taking the measurement.

The first kernel launch usually takes exceptionally long. So you might want to exclude the time of the first launch from your averaging process.

Christian

You re right the firt kernel take about 0.1 ms. What do you mean by “cook” ?

If I understand it correctly that is 50 microseconds on average for the entire code sequence shown (multiple synchronous copy operations plus kernel launch)? I don’t know what exact time you should expect, it seems plausible if a bit on the high side.

I do not see where cudaStart is being recorded. Make sure to place a cudaDeviceSynchronize() call before starting the measurements to ensure all prior GPU activity has finished. I don’t see where the code accounts for warm-up effects.

Regardless of platform, the first execution of anything can be a lot more costly than subsequent executions, so you would want to make at least one pass over the code before starting performance measurements. You may also want to consider a best-of-N timing approach, rather than averaging. This is what the STREAM benchmark does for example (which defaults to N=10 as I recall).

Hi

Sorry, I switched to an other subject for the past few weeks. 50 microseconds was only for one kernel laucnh.

As advised, I launch a bunch of kernel call (2000) and I got 5 microsecond per launch (for the 20 best call). So I guess I have an expected result.

Thank you for your help

Hi, I know this is a late entry, I am just starting to try using a GPU and found a big latency hurdle.
I’m trying to run 1000 times a calculation - triangles stuff - and I am finding a huge delay to start this calculation
To give an idea, the code looks like this:

T->W[0][0] = map[0] + map[3] * T->P[0][0];
T->W[0][1] = map[1] + map[4] * T->P[0][1];
T->W[0][2] = map[2] + map[5] * T->P[0][2];
T->W[1][0] = map[0] + map[3] * T->P[1][0];
T->W[1][1] = map[1] + map[4] * T->P[1][1];
T->W[1][2] = map[2] + map[5] * T->P[1][2];
T->W[2][0] = map[0] + map[3] * T->P[2][0];
T->W[2][1] = map[1] + map[4] * T->P[2][1];
T->W[2][2] = map[2] + map[5] * T->P[2][2];

T->w[0] = (int)ceil(min(min(T->W[0][0], T->W[1][0]), T->W[2][0]));
T->w[1] = (int)ceil(min(min(T->W[0][1], T->W[1][1]), T->W[2][1]));
T->w[2] = (int)ceil(min(min(T->W[0][2], T->W[1][2]), T->W[2][2]));
T->w[3] = (int)floor(max(max(T->W[0][0], T->W[1][0]), T->W[2][0]));
T->w[4] = (int)floor(max(max(T->W[0][1], T->W[1][1]), T->W[2][1]));
T->w[5] = (int)floor(max(max(T->W[0][2], T->W[1][2]), T->W[2][2]));

I can run 1000 this code for 1000 triangles with a call like
CalcTriangle << < grid, threads >> > (TB);
cudaDeviceSynchronize();

and each call takes 0.8ms on a Quadro M2000

If I run it twice within the same CU program it takes 0.95ms
CalcTriangle << < grid, threads >> > (TB);
cudaDeviceSynchronize();
CalcTriangle << < grid, threads >> > (TB);
cudaDeviceSynchronize();

It is like a big delay just when the code is loaded in the GPU. mem copies do not seem to add much delay, and If the code stays in the GPU this delay goes away.

I am fiding other issues that I assume are related to the calculations algorithm, it runs x10 faster on the CPU (in a single thread) than on the GPU with 6 cores. But this initial delay is discouraging us from using the GPU at all.

Anyone knows if what we are experiencing makes any sense ?

Josep

CUDA has a start-up delay, and general benchmarking principles suggest that the 2nd time you run a function it may run faster than the first. There can be a number of general reasons for this, one of which is cache behavior.

It’s evident you are on windows. If you are on windows, WDDM command batching can also affect timing.

If the code snippets are representative, I would expect the performance to be bound by memory bandwidth. If the stated sizes are representative, I would expect the entire working set to comfortably fit in CPU caches, which almost certainly have higher throughput than the memory on your low-end GPU. If what is described is the only part of the overall computation performed on the GPU, simply copying the data to the GPU and coping back the results will already be more expensive than performing the computation on the host.