CUDA OpenCL comparison

Hi,

in order to compare the performance of the 2 APIs, I had to get involved to the event system of both of them. At the Moment, my time measuring looks like this:

CUDA:

cudaEventRecord(cu_lasEvents[0],cu_lsStream);	

cudaEventSynchronize(cu_lasEvents[0]);

<<<kernel launch>>>	

	

cudaEventRecord(cu_lasEvents[1],cu_lsStream);	

cudaEventSynchronize(cu_lasEvents[1]);

cudaStreamSynchronize(cu_lsStream);	

cudaEventElapsedTime(&(sTime->kernel), cu_lasEvents[0], cu_lasEvents[1]);

OpenCL:

clEnqueueNDRangeKernel(ocl_lsQueue, ocl_lsKernel, 1, NULL, &lulGlobalWorkSize, &lulLocalWorkSize, 0, NULL, &ocl_lasEvents[0]);	

clWaitForEvents(3,ocl_lasEvents);

clGetEventProfilingInfo(ocl_lasEvents[0], CL_PROFILING_COMMAND_START, sizeof(cl_ulong),&ocl_ulStartFunction, &lulRet);

clGetEventProfilingInfo(ocl_lasEvents[0], CL_PROFILING_COMMAND_END, sizeof(cl_ulong),&ocl_ulStopKernel, &lulRet);

This is giving me pretty similar resulats (except memory interactions between host and device take a lot longer in CL)

But when I change “CL_PROFILING_COMMAND_START” to “CL_PROFILING_COMMAND_QUEUED”, it takes …for example…4ms longer.

Whats the best way to get comparable, relieable results?

Is it right to use cudaEventSynchronize on asynchronous devices with overlapping memcpy and kernel excecution?

Your CUDA code should be modified a bit:

wastekernel<<<>>>//launch a kernel to waste away some cycles first to hide pci latency

cudaEventRecord(cu_lasEvents[0],cu_lsStream); //do not sync

<<<kernel launch>>>        

cudaEventRecord(cu_lasEvents[1],cu_lsStream);

cudaEventSynchronize(cu_lasEvents[1]);

cudaStreamSynchronize(cu_lsStream);     //not necessary

cudaEventElapsedTime(&(sTime->kernel), cu_lasEvents[0], cu_lasEvents[1]);

For the OpenCL part, see this

It’s the same latency involved in queuing/pci transaction.

Thanks. In case i use CL_PROFILING_COMMAND_START and apply your changes (not the waste kernel, since i didn’t fully understood that. I’m measuring memory transactions before, so maybe its not necessary?) Kernel runtimes are pretty much exactly the same on cuda and opencl (for a simple vector addition)

I’m sorry for refreshing my Topic, but i have questions again.

  • Does it make any sense to use CPU timers like clock() for CUDA/OpenCL code to measure overall performance? I did so and received pretty weird results.

  • Cuda Events can be placed anywhere in Code, but OpenCL events are attached to a specific set of functions. How can i measure the performance of non-profiling functions like clCreateBuffer ect.?

  • Is there a proper way to measure the Elapsed time between the beginning and the end of a whole function compiled by nvcc or cl?

imagine:

in case you have something like

void *cu_addiere(float *pfA, float *pfB, float *pfC, unsigned long ulCount, unsigned long ulBlockSize)

{	

	cudaStream_t cu_lsStream;

	float *d_pfA=NULL;

	float *d_pfB=NULL;

	float *d_pfC=NULL;

	

	EVAL_CU(cudaStreamCreate(&cu_lsStream));

	

	EVAL_CU(cudaMalloc((void**)&d_pfA, sizeof(float)*ulCount));

	EVAL_CU(cudaMalloc((void**)&d_pfB, sizeof(float)*ulCount));

	EVAL_CU(cudaMalloc((void**)&d_pfC, sizeof(float)*ulCount));

	EVAL_CU(cudaMemcpyAsync(d_pfA, pfA, sizeof(float)*ulCount, cudaMemcpyHostToDevice,cu_lsStream));

	EVAL_CU(cudaMemcpyAsync(d_pfB, pfB, sizeof(float)*ulCount, cudaMemcpyHostToDevice,cu_lsStream));	

	addType<<<ulCount/ulBlockSize+1, ulBlockSize,0,cu_lsStream>>>(d_pfA,d_pfB,d_pfC,ulCount);	

	EVAL_CU(cudaGetLastError());

	

	EVAL_CU(cudaMemcpyAsync(pfC, d_pfC, sizeof(float)*ulCount, cudaMemcpyDeviceToHost,cu_lsStream));

	EVAL_CU(cudaFree(d_pfA));

	EVAL_CU(cudaFree(d_pfB));

	EVAL_CU(cudaFree(d_pfC));

	EVAL_CU(cudaStreamDestroy(cu_lsStream));

	

	return ;

}

How can i measure the time that the cpu has spent? I have seen it pretty often, that the cpu was fully busy, but the gpu idled at cuda code.

If you’re wondering about GPU execution time, you probably shouldn’t rely too much on CPU timers (since many CUDA calls are asynchronous). As far as I know, using CUDA events are the best route for timing GPU execution times from host-code.

From the Programming Guide, section 3.2.5.6.2:

cudaEventRecord(start, 0); 

for (int i = 0; i < 2; ++i) { 

  cudaMemcpyAsync(inputDev + i * size, inputHost + i * size, size, cudaMemcpyHostToDevice, stream[i]); 

  MyKernel<<<100, 512, 0, stream[i]>>> (outputDev + i * size, inputDev + i * size, size); 

  cudaMemcpyAsync(outputHost + i * size, outputDev + i * size, size, cudaMemcpyDeviceToHost, stream[i]); 

} 

cudaEventRecord(stop, 0); 

cudaEventSynchronize(stop); 

float elapsedTime; 

cudaEventElapsedTime(&elapsedTime, start, stop);

As for CPU-based timing, (assuming you’re on windows), you can use functions like “MSDN timeGetTime documentation” or if you need better timing resolution, “QueryPerformanceCounter” MSDN QueryPerformanceCounter documentation

Hope that helped

Actually i’m multiplatform. QT :)

I know how to (theoretically) measure GPU performance with the event system, but exactly the fact, that much of this stuff is async, and the fact that commercial software does not care how long the gpu executes the kernel, but is interested in how much (real) time a function will take practically, leads me to the demand of using cpu timers.

clock() didn’t work - weird results on cuda, opencl seemed allright.
I actually thaugt a blocking cudaThreadSync will make the cpu timer to tell me the truth…
QT Timers may be difficult since i need to link the nvcc to the QT libs, but it’s possible.

unless someone has a better idea or explanation for me :)

when you say this, you mean that you don’t care about the GPU execution time, but only the CPU execution time? As in, the time it takes for the CPU calling thread to perform a non-blocking kernel invocation? If I understand you correctly, that seems like an odd thing to measure, as it won’t have any correlation to the GPU calculation’s computational complexity, the only thing you’ll be measuring is the set-up time for the kernel call…

Yes, set up time, like allocating, copying, kernel launch time, and the other side, re-copy, free…

From my experience, this stuff takes much much longer than the kernel itself.

In most cases you ovisously can’t raise the performance significant by optimizing the kernel, but by optimizing the stuff around :)

I usually use a variant of ftime, scale the problem size so that it take several seconds, perform a large number of runs, and use the mean and standard error. It is independent of the API used, and one can time any part of the code sequence. For measuring overhead of kernel calls (or other code sequences, which are expected to be very fast), perform time measurement on a for-loop of the code sequence using a null kernel and a large number of iterations. Then, scale the timing data by the number of iterations. An alternative is to simply use a profiler.

Thats (almost) what i have done before, but looping around is not proper enough ;(

I think i will place a new topic at this evening with a more appropriate title, but already: thanks for your suggestions :)

Ok, done so far…clock() is working properly.

As expected opencl has a great overhead because of its jit compilation which gets relatively small when calculation gets more complex.