I have 8 gpus while the array I want to malloc is much larger than the memory of each gpu. But the total memory of 8 gpus can store it. So how can I malloc it among these gpus? The UM or any other way?
It’s unclear what the use case looks like. The classical way of dealing with matrices larger than the memory attached to a single processor is to use out-of-core techniques, and these can be parallelized. In this case each GPU would work on its own independently allocated chunk(s) of data, rather than a single unified matrix. See, for example:
Wesley C. Reiley and Robert A. van de Geijn, “POOCLAPACK: Parallel Out-of-Core Linear Algebra Package”, Technical Report, University of Texas Department of Computer Science, Nov. 1999
Although I do not have hands-on experience with out-of-core techniques, I think it likely that these traditional methods still have merit from a performance perspective. Depending on the specifics of your use case, you may no even have to set up anything manually, but may be able to rely on the cuBLASXt API provided by NVIDIA:
Possibly this resource is helpful.
Edited: To replace link to pdf, to link to page containing video and pdf.
Thanks. My use case is to do some simple calculations but in different dimensions.
For example the data is a huge matrix with two dimensions and I want to performace fft in both dimensions. Does it mean that a single unified matrix is necessary and efficient?
In an ordinary CUDA setting, there isn’t any way to get a single pointer that references data on separate GPUs. You would need at a minimum one pointer per GPU. Once you are thinking that way, there are many questions and even tutorials and training courses on how to do multi-GPU computing.
In a UM setting where the
concurrentManagedAccess property is true, you can allocate a single array that is (roughly speaking) limited by the amount of CPU memory you have, rather than by the amount of GPU memory you have. This is often referred to as oversubscription, in a single GPU (UM) setting. In a multi-GPU setting of the type I mentioned, and including an additional property that all the visible GPUs can be put into a peer relationship with each other, then the UM “oversubscription” allocation can be accessed on any GPU, using demand-paging migration of data (or, perhaps, partial prefetching). In that way portions of the large array could be migrated to or prefetched to separate GPUs. There would only be one pointer to reference the data, but of course individual pointer arithmetic would need to happen on each GPU if each GPU were accessing a separate segment of the array.
Note that CUFFT already has a facility to use multiple GPUs for large FFT work.
What will make the “single UM allocation” “efficient” is if your access patterns from each GPU are carefully controlled, and appropriate prefetching is done. The question cannot be answered and efficiency cannot be ascertained in the absence of any code or description of actual code behavior.
I would not expect it to be trivial to manage efficient (data access pattern) behavior for large multidimensional FFT work, so if it were me, I would certainly start with CUFFT and benchmark against CUFFT if I wanted to see if I could do something better.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.