device warm-up time in multi GPU implementation

I am currently working on multi GPU implementation on CUDA. I have a problem with device warm up time on each pthread execution. Currently I am using one Quadro FX 5600 and one Tesla D870 , so I have 3 GPU on the system working on a machine with core 2 quad processor (OS: linux fedora 8). In my implementation, I create 3 threads, then allow each thread to do the following processing steps on each thread for each device:

  1. cudaMalloc
  2. cudaMemcpy (H->D)
  3. kernel execution
  4. cudaMemcpy (D->H)

The code works but with a long processing time because in each new thread the cuda device need to warm-up (is this correct?). Is there some way to address this problem? (I just wonder if cuda context can solve this). Can anybody help me? Thank you very much in advance…