multi GPU programming issues time taken to create context?

Hi all,

Ive been having a problem with multiGPU computing and I am hoping someone can shed some light on it.

I am programming a multi GPU application. I have a program that loads 3 image frames every second from an external network, processes them on a GPU (or 3 GPUs in my new code), and then transfers them to another computer on another network. Previously I used a single GPU to process the three images in sequence and this worked well. My original (single GPU) code looked something like this – the following is just pseudo code to illustrate my question – I have left out many commands.

Void main()
//Allocate memory on host side
int *frame1, *frame2, frame3;
int N = 1024
frame1 = malloc(N * sizeof(int));
frame2 = malloc(N * sizeof(int));
frame3 = malloc(N * sizeof(int));

    //Allocate memory on GPU side
int *d_frame1, *d_frame2, *d_frame3;
    cudaMalloc((void**)&d_frame1, N * sizeof(int));	
    cudaMalloc((void**)&d_frame2, N * sizeof(int));	
    cudaMalloc((void**)&d_frame3, N * sizeof(int));	

            //Process image1
            cudaMemcpy(d_frame1, frame1, sizeof(int) * N, cudaMemcpyHostToDevice);
            cudaMemcpy(frame1, d_frame1, sizeof(int) * N, cudaMemcpyDeviceToHost);

            //Process image2
            cudaMemcpy(d_frame2, frame2, sizeof(int) * N, cudaMemcpyHostToDevice);
            cudaMemcpy(frame2, d_frame2, sizeof(int) * N, cudaMemcpyDeviceToHost);

           //Process image1
           cudaMemcpy(d_frame3, frame3, sizeof(int) * N, cudaMemcpyHostToDevice);
           cudaMemcpy(frame3, d_frame3, sizeof(int) * N, cudaMemcpyDeviceToHost);

    return 0;


Now I am using 3 GPUs to process one image each in every cycle. However, there are some issues relating to multiGPU processing that I am not familiar with and I am hoping someone out there can shed some light on the effects I am seeing - I list these issues under the code below; My first attempt at a multiGPU (using 3 GPUs) version looks like this

// Define structures to be used as input to multiGPU processing
typedef struct {
//Device id
int device;
int *frame;
int N;

// Define the number of GPUS available for processing the holograms to be 3, const int GPU_COUNT = 3;
CUTThread threadID[GPU_COUNT];

static CUT_THREADPROC solverThread(TGPUplan8 *plan)
//Set device
cutilSafeCall( cudaSetDevice(plan->device) );

   int *d_frame;
   cudaMalloc((void**)&d_frame, N * sizeof(int));	

   //Process image
   cudaMemcpy(d_frame, plan->frame, sizeof(int) * plan->N, cudaMemcpyHostToDevice);
   cudaMemcpy(plan->frame, d_frame, sizeof(int) * plan->N, cudaMemcpyDeviceToHost);



Void main()
int N = 1024*1024;

    TGPUplan8 plan[GPU_COUNT];
    for(int i = 0; i < GPU_COUNT; i++)
          plan[i]->N = N;
          plan[i]->device = i;
          plan[i]->frame = malloc(N * sizeof(int));


            //Run threads
            for(i = 0; i < GPU_COUNT; i++)
	       threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));



return 0;

The code works and it does indeed appear to process each of the three images on the three cards. However, there is a log delay introduced to every cycle of the while loop- approx 5 seconds on every cycle! I have done some reading and I believe this is caused by time taken for the ‘context’ for each CPU thread to be created. Since I am destroying the threads every time in the while loop this means that a context must be created on every cycle and this seems to take 5 seconds.

1> I was hoping someone could comment on all of this. I can find very little documentation on context creation. What exactly is a context and why might it take so long? Does every CUDA program require a 5 second delay? This seems a little strange to me. Can anyone point me to documentation or resources on the issue on context creation and time taken.

2> Also I was hoping people could comment on solutions to this problem. The obvious one for me is place the thread calls outside the while loop (this would involve rewriting the network upload and download functions so that they can be called by the individual threads. I would prefer not to do this if possible - is there perhaps some way I can use the code above and reate a thread pool that I can call from repeatedly and therefore only set up the contexts once.

Thanks to everybody in advance,