I am currently working on multi GPU implementation on CUDA. I have a problem with device warm up time on each pthread execution. Currently I am using one Quadro FX 5600 and one Tesla D870 , so I have 3 GPU on the system working on a machine with core 2 quad processor (OS: linux fedora 8). In my implementation, I create 3 threads, then allow each thread to do the following processing steps on each thread for each device:
- cudaMalloc
- cudaMemcpy (H->D)
- kernel execution
- cudaMemcpy (D->H)
The code works but with a long processing time because in each new thread the cuda device need to warm-up (is this correct?). Is there some way to address this problem? (I just wonder if cuda context can solve this). Can anybody help me? Thank you very much in advance…