multi GPU programming issues time taken to create context?

mr_bhen · March 21, 2011, 4:58pm

Hi all,

Ive been having a problem with multiGPU computing and I am hoping someone can shed some light on it.

I am programming a multi GPU application. I have a program that loads 3 image frames every second from an external network, processes them on a GPU (or 3 GPUs in my new code), and then transfers them to another computer on another network. Previously I used a single GPU to process the three images in sequence and this worked well. My original (single GPU) code looked something like this â€“ the following is just pseudo code to illustrate my question â€“ I have left out many commands.

Void main()
{
//Allocate memory on host side
int *frame1, *frame2, frame3;
int N = 10241024;
frame1 = malloc(N * sizeof(int));
frame2 = malloc(N * sizeof(int));
frame3 = malloc(N * sizeof(int));

    //Allocate memory on GPU side
int *d_frame1, *d_frame2, *d_frame3;
    cudaMalloc((void**)&d_frame1, N * sizeof(int));	
    cudaMalloc((void**)&d_frame2, N * sizeof(int));	
    cudaMalloc((void**)&d_frame3, N * sizeof(int));	

While(endofvideo)
{
	download_new_frames_from_external_network(frame1,frame2,frame3);
	
            //Process image1
            cudaMemcpy(d_frame1, frame1, sizeof(int) * N, cudaMemcpyHostToDevice);
            GPU_process(d_frame1);
            cudaMemcpy(frame1, d_frame1, sizeof(int) * N, cudaMemcpyDeviceToHost);
            cudaFree(d_frame1);

            //Process image2
            cudaMemcpy(d_frame2, frame2, sizeof(int) * N, cudaMemcpyHostToDevice);
            GPU_process(d_frame2);
            cudaMemcpy(frame2, d_frame2, sizeof(int) * N, cudaMemcpyDeviceToHost);
            cudaFree(d_frame2);

           //Process image1
           cudaMemcpy(d_frame3, frame3, sizeof(int) * N, cudaMemcpyHostToDevice);
           GPU_process(d_frame2);
           cudaMemcpy(frame3, d_frame3, sizeof(int) * N, cudaMemcpyDeviceToHost);
           cudaFree(d_frame3);

           upload_processed_frames_to_external_network(frame1,frame2,frame3);
    }
    return 0;

}

Now I am using 3 GPUs to process one image each in every cycle. However, there are some issues relating to multiGPU processing that I am not familiar with and I am hoping someone out there can shed some light on the effects I am seeing - I list these issues under the code below; My first attempt at a multiGPU (using 3 GPUs) version looks like this

// Define structures to be used as input to multiGPU processing
typedef struct {
//Device id
int device;
int *frame;
int N;
}TGPUplan8;

// Define the number of GPUS available for processing the holograms to be 3, const int GPU_COUNT = 3;
CUTThread threadID[GPU_COUNT];

static CUT_THREADPROC solverThread(TGPUplan8 *plan)
{
//Set device
cutilSafeCall( cudaSetDevice(plan->device) );

   int *d_frame;
   cudaMalloc((void**)&d_frame, N * sizeof(int));	

   //Process image
   cudaMemcpy(d_frame, plan->frame, sizeof(int) * plan->N, cudaMemcpyHostToDevice);
   GPU_process(d_frame);
   cudaMemcpy(plan->frame, d_frame, sizeof(int) * plan->N, cudaMemcpyDeviceToHost);
   cudaFree(d_frame);

   CUT_THREADEND;

}

Void main()
{
int N = 1024*1024;

    TGPUplan8 plan[GPU_COUNT];
    for(int i = 0; i < GPU_COUNT; i++)
    {
          plan[i]->N = N;
          plan[i]->device = i;
          plan[i]->frame = malloc(N * sizeof(int));
    }

while(endofvideo)
{
	download_new_frames_from_external_network(plan[0]->frame,plan[1]->frame,plan[2]->frame);

            //Run threads
            for(i = 0; i < GPU_COUNT; i++)
            {
	       threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));
            }
            cudaThreadExit();

	upload_processed_frames_to_external_network(plan[0]->frame,plan[1]->frame,plan[2]->frame);

}

return 0;
}

The code works and it does indeed appear to process each of the three images on the three cards. However, there is a log delay introduced to every cycle of the while loop- approx 5 seconds on every cycle! I have done some reading and I believe this is caused by time taken for the ‘context’ for each CPU thread to be created. Since I am destroying the threads every time in the while loop this means that a context must be created on every cycle and this seems to take 5 seconds.

1> I was hoping someone could comment on all of this. I can find very little documentation on context creation. What exactly is a context and why might it take so long? Does every CUDA program require a 5 second delay? This seems a little strange to me. Can anyone point me to documentation or resources on the issue on context creation and time taken.

2> Also I was hoping people could comment on solutions to this problem. The obvious one for me is place the thread calls outside the while loop (this would involve rewriting the network upload and download functions so that they can be called by the individual threads. I would prefer not to do this if possible - is there perhaps some way I can use the code above and reate a thread pool that I can call from repeatedly and therefore only set up the contexts once.

Thanks to everybody in advance,
Best,
Bryan

mr_bhen · March 28, 2011, 9:21am

Anybody?

Hi all,

Ive been having a problem with multiGPU computing and I am hoping someone can shed some light on it.

I am programming a multi GPU application. I have a program that loads 3 image frames every second from an external network, processes them on a GPU (or 3 GPUs in my new code), and then transfers them to another computer on another network. Previously I used a single GPU to process the three images in sequence and this worked well. My original (single GPU) code looked something like this â€“ the following is just pseudo code to illustrate my question â€“ I have left out many commands.

Void main()

{
//Allocate memory on host side

    int *frame1, *frame2, *frame3;

    int N = 1024*1024;

    frame1 = malloc(N * sizeof(int));

    frame2 = malloc(N * sizeof(int));

    frame3 = malloc(N * sizeof(int));
//Allocate memory on GPU side
int *d_frame1, *d_frame2, *d_frame3;

    cudaMalloc((void**)&d_frame1, N * sizeof(int));	

    cudaMalloc((void**)&d_frame2, N * sizeof(int));	

    cudaMalloc((void**)&d_frame3, N * sizeof(int));	

While(endofvideo)

{

	download_new_frames_from_external_network(frame1,frame2,frame3);

	

            //Process image1

            cudaMemcpy(d_frame1, frame1, sizeof(int) * N, cudaMemcpyHostToDevice);

            GPU_process(d_frame1);

            cudaMemcpy(frame1, d_frame1, sizeof(int) * N, cudaMemcpyDeviceToHost);

            cudaFree(d_frame1);
//Process image2
            cudaMemcpy(d_frame2, frame2, sizeof(int) * N, cudaMemcpyHostToDevice);

            GPU_process(d_frame2);

            cudaMemcpy(frame2, d_frame2, sizeof(int) * N, cudaMemcpyDeviceToHost);

            cudaFree(d_frame2);
//Process image1
           cudaMemcpy(d_frame3, frame3, sizeof(int) * N, cudaMemcpyHostToDevice);

           GPU_process(d_frame2);

           cudaMemcpy(frame3, d_frame3, sizeof(int) * N, cudaMemcpyDeviceToHost);

           cudaFree(d_frame3);
upload_processed_frames_to_external_network(frame1,frame2,frame3);
    }

    return 0;
}

Now I am using 3 GPUs to process one image each in every cycle. However, there are some issues relating to multiGPU processing that I am not familiar with and I am hoping someone out there can shed some light on the effects I am seeing - I list these issues under the code below; My first attempt at a multiGPU (using 3 GPUs) version looks like this

// Define structures to be used as input to multiGPU processing

typedef struct {
//Device id

int device;

int *frame;

int N;
}TGPUplan8;

// Define the number of GPUS available for processing the holograms to be 3, const int GPU_COUNT = 3;

CUTThread threadID[GPU_COUNT];

static CUT_THREADPROC solverThread(TGPUplan8 *plan)

{
   //Set device

   cutilSafeCall( cudaSetDevice(plan->device) );
int *d_frame;
   cudaMalloc((void**)&d_frame, N * sizeof(int));	
//Process image
   cudaMemcpy(d_frame, plan->frame, sizeof(int) * plan->N, cudaMemcpyHostToDevice);

   GPU_process(d_frame);

   cudaMemcpy(plan->frame, d_frame, sizeof(int) * plan->N, cudaMemcpyDeviceToHost);

   cudaFree(d_frame);
CUT_THREADEND;

}

Void main()

{
int N = 1024*1024;
TGPUplan8 plan[GPU_COUNT];
    for(int i = 0; i < GPU_COUNT; i++)

    {

          plan[i]->N = N;

          plan[i]->device = i;

          plan[i]->frame = malloc(N * sizeof(int));

    }

while(endofvideo)

{

	download_new_frames_from_external_network(plan[0]->frame,plan[1]->frame,plan[2]->frame);
//Run threads
            for(i = 0; i < GPU_COUNT; i++)

            {

	       threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, (void *)(plan + i));

            }

            cudaThreadExit();

	upload_processed_frames_to_external_network(plan[0]->frame,plan[1]->frame,plan[2]->frame);
}

return 0;

}

The code works and it does indeed appear to process each of the three images on the three cards. However, there is a log delay introduced to every cycle of the while loop- approx 5 seconds on every cycle! I have done some reading and I believe this is caused by time taken for the ‘context’ for each CPU thread to be created. Since I am destroying the threads every time in the while loop this means that a context must be created on every cycle and this seems to take 5 seconds.

1> I was hoping someone could comment on all of this. I can find very little documentation on context creation. What exactly is a context and why might it take so long? Does every CUDA program require a 5 second delay? This seems a little strange to me. Can anyone point me to documentation or resources on the issue on context creation and time taken.

2> Also I was hoping people could comment on solutions to this problem. The obvious one for me is place the thread calls outside the while loop (this would involve rewriting the network upload and download functions so that they can be called by the individual threads. I would prefer not to do this if possible - is there perhaps some way I can use the code above and reate a thread pool that I can call from repeatedly and therefore only set up the contexts once.

Thanks to everybody in advance,

Best,

Bryan

Topic		Replies	Views
problem with multi GPU application CUDA Programming and Performance	2	4285	March 4, 2009
IDEA: Intrinsic multi-GPU support (Even over a network) CUDA Programming and Performance	7	9590	January 1, 2009
Support for multi-threaded apps on cuda and multiple applications on cuda CUDA Programming and Performance	13	12731	January 24, 2011
Simple multiGPU - Why is it failed Example to understand how multiGPU work CUDA Programming and Performance	8	4343	March 6, 2008
Cuda code performance CUDA Programming and Performance	14	3129	December 16, 2014
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8616	December 18, 2008
texture memory performance on a multiGPU system takes too much time to setup a texture for some GPUs CUDA Programming and Performance	9	1391	February 16, 2012
Multi stream multi GPU CUDA Programming and Performance cuda	9	1057	October 6, 2023
Multiple GPUs Devise a synchro mechanism for host threads CUDA Programming and Performance	7	4198	May 13, 2010
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28003	March 4, 2012

multi GPU programming issues time taken to create context?

Related topics