CUDA OpenCL comparison

Snowball_Two · July 6, 2011, 9:16am

Hi,

in order to compare the performance of the 2 APIs, I had to get involved to the event system of both of them. At the Moment, my time measuring looks like this:

CUDA:

cudaEventRecord(cu_lasEvents[0],cu_lsStream);	

cudaEventSynchronize(cu_lasEvents[0]);

<<<kernel launch>>>	

	

cudaEventRecord(cu_lasEvents[1],cu_lsStream);	

cudaEventSynchronize(cu_lasEvents[1]);

cudaStreamSynchronize(cu_lsStream);	

cudaEventElapsedTime(&(sTime->kernel), cu_lasEvents[0], cu_lasEvents[1]);

OpenCL:

clEnqueueNDRangeKernel(ocl_lsQueue, ocl_lsKernel, 1, NULL, &lulGlobalWorkSize, &lulLocalWorkSize, 0, NULL, &ocl_lasEvents[0]);	

clWaitForEvents(3,ocl_lasEvents);

clGetEventProfilingInfo(ocl_lasEvents[0], CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&ocl_ulStartFunction, &lulRet);

clGetEventProfilingInfo(ocl_lasEvents[0], CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&ocl_ulStopKernel, &lulRet);

This is giving me pretty similar resulats (except memory interactions between host and device take a lot longer in CL)

But when I change “CL_PROFILING_COMMAND_START” to “CL_PROFILING_COMMAND_QUEUED”, it takes …for example…4ms longer.

Whats the best way to get comparable, relieable results?

Is it right to use cudaEventSynchronize on asynchronous devices with overlapping memcpy and kernel excecution?

hyqneuron · July 8, 2011, 11:04am

Your CUDA code should be modified a bit:

wastekernel<<<>>>//launch a kernel to waste away some cycles first to hide pci latency

cudaEventRecord(cu_lasEvents[0],cu_lsStream); //do not sync

<<<kernel launch>>>        

cudaEventRecord(cu_lasEvents[1],cu_lsStream);

cudaEventSynchronize(cu_lasEvents[1]);

cudaStreamSynchronize(cu_lsStream);     //not necessary

cudaEventElapsedTime(&(sTime->kernel), cu_lasEvents[0], cu_lasEvents[1]);

For the OpenCL part, see this

It’s the same latency involved in queuing/pci transaction.

Snowball_Two · July 9, 2011, 9:59am

Thanks. In case i use CL_PROFILING_COMMAND_START and apply your changes (not the waste kernel, since i didn’t fully understood that. I’m measuring memory transactions before, so maybe its not necessary?) Kernel runtimes are pretty much exactly the same on cuda and opencl (for a simple vector addition)

Snowball_Two · August 22, 2011, 3:28pm

I’m sorry for refreshing my Topic, but i have questions again.

Does it make any sense to use CPU timers like clock() for CUDA/OpenCL code to measure overall performance? I did so and received pretty weird results.
Cuda Events can be placed anywhere in Code, but OpenCL events are attached to a specific set of functions. How can i measure the performance of non-profiling functions like clCreateBuffer ect.?
Is there a proper way to measure the Elapsed time between the beginning and the end of a whole function compiled by nvcc or cl?

imagine:

in case you have something like

void *cu_addiere(float *pfA, float *pfB, float *pfC, unsigned long ulCount, unsigned long ulBlockSize)

{	

	cudaStream_t cu_lsStream;

	float *d_pfA=NULL;

	float *d_pfB=NULL;

	float *d_pfC=NULL;

	

	EVAL_CU(cudaStreamCreate(&cu_lsStream));

	

	EVAL_CU(cudaMalloc((void**)&d_pfA, sizeof(float)*ulCount));

	EVAL_CU(cudaMalloc((void**)&d_pfB, sizeof(float)*ulCount));

	EVAL_CU(cudaMalloc((void**)&d_pfC, sizeof(float)*ulCount));

	EVAL_CU(cudaMemcpyAsync(d_pfA, pfA, sizeof(float)*ulCount, cudaMemcpyHostToDevice,cu_lsStream));

	EVAL_CU(cudaMemcpyAsync(d_pfB, pfB, sizeof(float)*ulCount, cudaMemcpyHostToDevice,cu_lsStream));	

	addType<<<ulCount/ulBlockSize+1, ulBlockSize,0,cu_lsStream>>>(d_pfA,d_pfB,d_pfC,ulCount);	

	EVAL_CU(cudaGetLastError());

	

	EVAL_CU(cudaMemcpyAsync(pfC, d_pfC, sizeof(float)*ulCount, cudaMemcpyDeviceToHost,cu_lsStream));

	EVAL_CU(cudaFree(d_pfA));

	EVAL_CU(cudaFree(d_pfB));

	EVAL_CU(cudaFree(d_pfC));

	EVAL_CU(cudaStreamDestroy(cu_lsStream));

	

	return ;

}

How can i measure the time that the cpu has spent? I have seen it pretty often, that the cpu was fully busy, but the gpu idled at cuda code.

alrikai · August 22, 2011, 7:11pm

I’m sorry for refreshing my Topic, but i have questions again.

Does it make any sense to use CPU timers like clock() for CUDA/OpenCL code to measure overall performance? I did so and received pretty weird results.

Cuda Events can be placed anywhere in Code, but OpenCL events are attached to a specific set of functions. How can i measure the performance of non-profiling functions like clCreateBuffer ect.?

Is there a proper way to measure the Elapsed time between the beginning and the end of a whole function compiled by nvcc or cl?

imagine:

in case you have something like
void *cu_addiere(float *pfA, float *pfB, float *pfC, unsigned long ulCount, unsigned long ulBlockSize)

{	

	cudaStream_t cu_lsStream;

	float *d_pfA=NULL;

	float *d_pfB=NULL;

	float *d_pfC=NULL;

	

	EVAL_CU(cudaStreamCreate(&cu_lsStream));

	

	EVAL_CU(cudaMalloc((void**)&d_pfA, sizeof(float)*ulCount));

	EVAL_CU(cudaMalloc((void**)&d_pfB, sizeof(float)*ulCount));

	EVAL_CU(cudaMalloc((void**)&d_pfC, sizeof(float)*ulCount));

	EVAL_CU(cudaMemcpyAsync(d_pfA, pfA, sizeof(float)*ulCount, cudaMemcpyHostToDevice,cu_lsStream));

	EVAL_CU(cudaMemcpyAsync(d_pfB, pfB, sizeof(float)*ulCount, cudaMemcpyHostToDevice,cu_lsStream));	

	addType<<<ulCount/ulBlockSize+1, ulBlockSize,0,cu_lsStream>>>(d_pfA,d_pfB,d_pfC,ulCount);	

	EVAL_CU(cudaGetLastError());

	

	EVAL_CU(cudaMemcpyAsync(pfC, d_pfC, sizeof(float)*ulCount, cudaMemcpyDeviceToHost,cu_lsStream));

	EVAL_CU(cudaFree(d_pfA));

	EVAL_CU(cudaFree(d_pfB));

	EVAL_CU(cudaFree(d_pfC));

	EVAL_CU(cudaStreamDestroy(cu_lsStream));

	

	return ;

}
How can i measure the time that the cpu has spent? I have seen it pretty often, that the cpu was fully busy, but the gpu idled at cuda code.

If you’re wondering about GPU execution time, you probably shouldn’t rely too much on CPU timers (since many CUDA calls are asynchronous). As far as I know, using CUDA events are the best route for timing GPU execution times from host-code.

From the Programming Guide, section 3.2.5.6.2:

cudaEventRecord(start, 0); 

for (int i = 0; i < 2; ++i) { 

  cudaMemcpyAsync(inputDev + i * size, inputHost + i * size, size, cudaMemcpyHostToDevice, stream[i]); 

  MyKernel<<<100, 512, 0, stream[i]>>> (outputDev + i * size, inputDev + i * size, size); 

  cudaMemcpyAsync(outputHost + i * size, outputDev + i * size, size, cudaMemcpyDeviceToHost, stream[i]); 

} 

cudaEventRecord(stop, 0); 

cudaEventSynchronize(stop); 

float elapsedTime; 

cudaEventElapsedTime(&elapsedTime, start, stop);

As for CPU-based timing, (assuming you’re on windows), you can use functions like “MSDN timeGetTime documentation” or if you need better timing resolution, “QueryPerformanceCounter” MSDN QueryPerformanceCounter documentation

Hope that helped

Snowball_Two · August 22, 2011, 7:29pm

Actually i’m multiplatform. QT :)

I know how to (theoretically) measure GPU performance with the event system, but exactly the fact, that much of this stuff is async, and the fact that commercial software does not care how long the gpu executes the kernel, but is interested in how much (real) time a function will take practically, leads me to the demand of using cpu timers.

clock() didn’t work - weird results on cuda, opencl seemed allright.
I actually thaugt a blocking cudaThreadSync will make the cpu timer to tell me the truth…
QT Timers may be difficult since i need to link the nvcc to the QT libs, but it’s possible.

unless someone has a better idea or explanation for me :)

alrikai · August 22, 2011, 8:16pm

when you say this, you mean that you don’t care about the GPU execution time, but only the CPU execution time? As in, the time it takes for the CPU calling thread to perform a non-blocking kernel invocation? If I understand you correctly, that seems like an odd thing to measure, as it won’t have any correlation to the GPU calculation’s computational complexity, the only thing you’ll be measuring is the set-up time for the kernel call…

Snowball_Two · August 23, 2011, 5:15am

Yes, set up time, like allocating, copying, kernel launch time, and the other side, re-copy, free…

From my experience, this stuff takes much much longer than the kernel itself.

In most cases you ovisously can’t raise the performance significant by optimizing the kernel, but by optimizing the stuff around :)

Ken_Domino · August 23, 2011, 10:57am

I usually use a variant of ftime, scale the problem size so that it take several seconds, perform a large number of runs, and use the mean and standard error. It is independent of the API used, and one can time any part of the code sequence. For measuring overhead of kernel calls (or other code sequences, which are expected to be very fast), perform time measurement on a for-loop of the code sequence using a null kernel and a large number of iterations. Then, scale the timing data by the number of iterations. An alternative is to simply use a profiler.

Snowball_Two · August 23, 2011, 11:03am

Thats (almost) what i have done before, but looping around is not proper enough ;(

I think i will place a new topic at this evening with a more appropriate title, but already: thanks for your suggestions :)

Ok, done so far…clock() is working properly.

As expected opencl has a great overhead because of its jit compilation which gets relatively small when calculation gets more complex.

Topic		Replies	Views
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35988	July 12, 2011
Number of GPU clock cycles CUDA Programming and Performance	15	10274	June 16, 2017
Timing in OpenCL vs. CUDA How to verify that I'm measuring the same thing in OpenCL and CUDA? CUDA Programming and Performance	1	7672	December 6, 2010
CUDA event timer or C++11 <chrono> timers, which one should I use? CUDA Programming and Performance	4	3978	May 21, 2019
Concurrent kernel timing with cudaEvents CUDA Programming and Performance	1	1901	April 27, 2017
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13333	July 9, 2008
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7736	December 5, 2008
Why CUDA slower that OpenCL? CUDA Programming and Performance	5	1526	September 12, 2018
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	5980	September 8, 2009
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	488	December 17, 2017

CUDA OpenCL comparison

Related topics