Multi-gpu timing/profiling

How are other folks out there handling the problem of timing and profiling multi-gpu programs?

I’ve just learned after searching these forums that the cutil timers aren’t thread safe, which explains the random segfaults and glibc memory corruption errors I was getting when I added timers to my multi-gpu implementation. However I didn’t manage to find any good solutions. What’s the alternative?
Is there a recommended high performance timing library out there that’s pretty easy to use? I could probably get away with using the standard time.h clock() functionality for kernel timing, however I’d also like to profile all of my cuda memcpys and those are normally so short that clock doesn’t have enough resolution to time it properly.

Is there a way to get the cuda profiler to generate multiple log files, one for each GPU? It wouldn’t be as nice as having an immediate print out of the time for each iteration in my program but at least I could write a script to mine the results I’m looking for.

I’m using CUDA 2.3 on 64-bit linux.


I do not know if CUDA timing is threadsafe or not, but it may actually be.
Check “Event Management” in the CUDA Reference Manual, page 16.

A great example of why not to use cutil, I guess (tmurray will be thrilled)…

I think the linux version of cutil uses getimeofday for its timers, which isn’t a thread safe function. If you are using OpenMP or pthreads, then you best bet is probably to use native POSIX timers, which are thread safe.

cutil are indeed not thread-safe, but as far as i remember so is the event mechanism.

I found that the best way is this:

Time your code using events however when you first create the event, create it per thread and make it

thread safe - I do it at program startup - just create all types of timers i need (for example:

prepareInputTimer, prepareOutputTimer, kernelATimer, kernelBTimer, copyResultToHostTimer) and

in a regular thread-safe manner (using mutex/lock/whatever).

Once you’ve done this - you can use the event api as if you’re single gpu. I also log the timings to different

log files - one log file per GPU - works like a charm :)

you can then also compare the timings between different types of gpus ( I have GTX280 and a C1060 and

you can see the differences very clearly)


I’ll read up on handling events manually and following eyal’s suggestion, thanks.

How do you setup multiple log files for each GPU? I was looking at the profile config and all I saw was a single CUDA_PROFILE_LOG environment variable. Or did you mean that you have a separate log file for each gpu using your own event timers?

I ended up kind of mimicking the cutil timer api like this:

typedef struct {

	cudaEvent_t start;

	cudaEvent_t stop;

	float* et;

} cudaTimer_t;

void createTimer(cudaTimer_t* timer) {

	#pragma omp critical (create_timer)




		timer->et = (float*) malloc(sizeof(float));

		*(timer->et) = 0.0f;



void deleteTimer(cudaTimer_t timer) {

	#pragma omp critical (delete_timer)







void startTimer(cudaTimer_t timer) {



void stopTimer(cudaTimer_t timer) {



	float tmp;


	*( += tmp;


float getTimerValue(cudaTimer_t timer) {

	return *(;


So far no problems, thanks again.

yes I meant my own application log files. I find it usefull most of the time to figure out where are the bottlenecks.

Visual Profiler supports profiling multiple GPU programs. The profiler output for each GPU will be shown under a different context.

If you are using driver level profiling set the CUDA_PROFILE_LOG environment variable with a ‘%d’ so that different log files are generated for each context:

export CUDA_PROFILE_LOG=cuda_profile_%d.txt

Look at the document “CUDA_Profiler_2.3.txt” included under the “doc” directory in the CUDA 2.3 toolkit for more details.

Did anyone here as already experimented the %d trick on CUDA_PROFILE_LOG ?
It didn’t work for me as it always return 0 and it seems to me that devices owerwrite each other values.

I am using cuda 2.3 with driver 190.42 on a linux x86/64 cluster, tesla S1070 plugged on 2 nodes so you see only 2 device available per node.

Thanks to you.