Allocating memory on multiple GPUs

When using multiple GPUs is there a more efficient method of allocating and deallocating memory than doing so in the CUTTHREAD function?

I have the following code structure, which I expect is typical of the scientific CUDA code.

[codebox]for(i=0 ; i<MAXTIMESTEPS ; i++)

{

function1<<<>>>();

function2<<<>>>();

function3<<<>>>();

}[/codebox]

When I use multiple GPUs the calls to functions 1,2 and 3 are replaced by calls to three CUTTHREAD functions which allocate memory on the GPUs depending on the cutthread id, transfer data from host to device, the GPUs do the work and I transfer data back from the GPUs into host memory and deallocate memory on the GPUs. As these cutthread functions are called thousands of times that is a lot of allocating and deallocating of memory.

Can I not declare the memory on each of the GPUs just once at the start of the host code, perform the loop and transferring data to and from each GPU ras it evolves eferring to the variables declared on each GPU, and deallocate GPU memory just once at the end of the host code?

To get reasonable performance, you need to create your threads outside the loop and then call them inside the loop. Creating and destroying threads inside a loop is always a bad idea.

MisterAnderson42 has a really great C++ helper class, called “GPUWorker,” to facilitate using multiple GPUs in this way:

http://forums.nvidia.com/index.php?showtopic=66598

I highly recommend using it if you can, as it does all the ugly CPU thread synchronization for you.

It’s not the creation of the threads, because they are created as an array as in the multiGPU example in the CUDA 2 SDK. It’s the constant allocating and deallocating of memory on the GPUs, but maybe that can be done outside the loops and then pass the data to the GPU arrays inside the loops only. Thanks for that idea.

And the GPUworker class looks very useful. Where can I get the boost stuff from?

It’s not the creation of the threads, because they are created as an array as in the multiGPU example in the CUDA 2 SDK. It’s the constant allocating and deallocating of memory on the GPUs, but maybe that can be done outside the loops and then pass the data to the GPU arrays inside the loops only. Thanks for that idea.

And the GPUworker class looks very useful. Where can I get the boost stuff from?

Yep, it’s a handy way to do a master/slave approach to multi-gpu programming.

Boost can be gotten at http://www.boost.org/ . If you are on linux, your linux distro most likely has a package in the package manager, if it is not already installed.