How to malloc an array in multiple gpus?

czhen · April 2, 2023, 3:44pm

I have 8 gpus while the array I want to malloc is much larger than the memory of each gpu. But the total memory of 8 gpus can store it. So how can I malloc it among these gpus? The UM or any other way?

njuffa · April 2, 2023, 5:39pm

It’s unclear what the use case looks like. The classical way of dealing with matrices larger than the memory attached to a single processor is to use out-of-core techniques, and these can be parallelized. In this case each GPU would work on its own independently allocated chunk(s) of data, rather than a single unified matrix. See, for example:

Wesley C. Reiley and Robert A. van de Geijn, “POOCLAPACK: Parallel Out-of-Core Linear Algebra Package”, Technical Report, University of Texas Department of Computer Science, Nov. 1999

Although I do not have hands-on experience with out-of-core techniques, I think it likely that these traditional methods still have merit from a performance perspective. Depending on the specifics of your use case, you may no even have to set up anything manually, but may be able to rely on the cuBLASXt API provided by NVIDIA:

rs277 · April 2, 2023, 6:57pm

Possibly this resource is helpful.

Edited: To replace link to pdf, to link to page containing video and pdf.

czhen · April 3, 2023, 1:36am

Thanks. My use case is to do some simple calculations but in different dimensions.

For example the data is a huge matrix with two dimensions and I want to performace fft in both dimensions. Does it mean that a single unified matrix is necessary and efficient?

Robert_Crovella · April 3, 2023, 2:01am

In an ordinary CUDA setting, there isn’t any way to get a single pointer that references data on separate GPUs. You would need at a minimum one pointer per GPU. Once you are thinking that way, there are many questions and even tutorials and training courses on how to do multi-GPU computing.

In a UM setting where the concurrentManagedAccess property is true, you can allocate a single array that is (roughly speaking) limited by the amount of CPU memory you have, rather than by the amount of GPU memory you have. This is often referred to as oversubscription, in a single GPU (UM) setting. In a multi-GPU setting of the type I mentioned, and including an additional property that all the visible GPUs can be put into a peer relationship with each other, then the UM “oversubscription” allocation can be accessed on any GPU, using demand-paging migration of data (or, perhaps, partial prefetching). In that way portions of the large array could be migrated to or prefetched to separate GPUs. There would only be one pointer to reference the data, but of course individual pointer arithmetic would need to happen on each GPU if each GPU were accessing a separate segment of the array.

Note that CUFFT already has a facility to use multiple GPUs for large FFT work.

What will make the “single UM allocation” “efficient” is if your access patterns from each GPU are carefully controlled, and appropriate prefetching is done. The question cannot be answered and efficiency cannot be ascertained in the absence of any code or description of actual code behavior.

I would not expect it to be trivial to manage efficient (data access pattern) behavior for large multidimensional FFT work, so if it were me, I would certainly start with CUFFT and benchmark against CUFFT if I wanted to see if I could do something better.

system · April 17, 2023, 2:01am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can a large buffer be "split" across multiple GPUs? CUDA Programming and Performance	6	726	September 29, 2019
Dynamic memory allocation by several gangs at the same time nvc, nvc++ and nvfortran	6	39	October 28, 2024
Sharing GPU global memory with multiple CPU threads CUDA Programming and Performance	5	2779	February 26, 2019
CUDA C++ - how to define an array of unknown size in cuda kernel (not with extern __shared__ )? CUDA Programming and Performance cuda , kernel	6	2237	November 5, 2021
Memory fragmentation CUDA Programming and Performance	5	6802	October 13, 2009
allocating double pointer memory in GPU CUDA Programming and Performance	3	11787	February 3, 2011
multiple gpu and unified memory CUDA Programming and Performance	3	4618	March 29, 2022
A (not so) hypothetical question CUDA Programming and Performance	6	1649	March 24, 2009
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4674	April 23, 2009
cudaMalloc on the same pointer CUDA Programming and Performance	11	776	February 25, 2020

How to malloc an array in multiple gpus?

Related topics