PThreads or OpenMP in CUDA 4.0?

Given the new features in CUDA 4.0,
would it be better to use PThreads or OpenMP for programs where large amounts of data have to be processed by multiple GPUs?

Depending on what you are trying to do and what hardware you are trying to do it on, the appropriate answer might well be that you don’t need either.

Given the new features of CUDA 4.0…

If I understood the presentation, you can use neither. Based on the applications I’ve been developing, I’d go for the new feature of setDevice(X) each time you want to control another device. For my case, it works well, the PCIe bus control doesn’t have to change its controlling thread upon changing device as well.

Something like the simpleMultiGPU example:

for(int i=0;i<GPU_N;i++){

    //gain control of device i

    cudaSetDevice(i);

    //assynchronously copy data to device

    cudaMemcpyAsync(h_data[i],d_data[i],N*sizeof(/*var type*/),cudaMemcpyHostToDevice,stream[i]);

    //assynchronously launch kernel

    some_kernel<<<blocks,threads,/*some shared memory amount*/,stream[i]>>>(d_data[i], /*more arguments as fit*/);

    //assynchronously copy data from device

    cudaMemcpyAsync(d_data[i],h_data[i],N*sizeof(/*var type*/),cudaMemcpyDeviceToHost,stream[i]);

}

//wait for devices to finish

for(int i=0;i<GPU_N;i++){

    //gain control of device i

    cudaSetDevice(i);

    cudaStreamSynchronize(stream[i]);

}

Et voilà ! No OpenMP or pthreads involved.

Although I can’t tell you how you can measure the execution time using cudaEvents. In the same fashion simpleStreams is time with cudaEvents, on the snippet above the elapsed time is 0.

Cheers.

In the programming guide it reads cudaEventElapsedTime() fails if the two cudaEvents passed as arguments lie on different devices. Which explains why I get zero for the elapsed time.

If I define different start and stop cudaEvents for each device and add them I’ll get the total execution time plus the overlap between devices. The overlap is undefined and elapsed time will always evaluate to more than the execution time. How can I use the devices counters to provide for the accurate execution time?

CPU timers and CPU timers only will output the execution time I want?

[edit]
[Sorry wrong topic, but useful to my previous unknowing]