cudaMalloc on the same pointer

Hello,

Let’s say that I have a single declared pointer and now I want to cudaMalloc to differents GPU:

float *DevMem;
for (int i_GPU = 0; i_GPU < GPUNumber; i_GPU++)
{
 cudaSetDevice(i_GPU);
 cudaMalloc(&DevMem);
}
...

Is there a issue to proceed like in this code?

Thanks

How would you access or free any but the last allocated chunk of memory?

https://en.wikipedia.org/wiki/Memory_leak

Yes! I want to allocate/free memory separately as cudaMalloc and cudaFree are synchronous (Just to make sure that every kernel will be run freely on the respective GPU without any blocking)

for (int i_GPU = 0; i_GPU < GPUNumber; i_GPU++)
{
 cudaSetDevice(i_GPU);
 cudaMalloc(&DevMem);
}
for (int i_GPU = 0; i_GPU < GPUNumber; i_GPU++)
{
 cudaSetDevice(i_GPU);
 ...Executing some kernels asynchronously
}
for (int i_GPU = 0; i_GPU < GPUNumber; i_GPU++)
{
 cudaSetDevice(i_GPU);
 cudaFree(DevMem);
}

How can a single pointer stored in DevMem simultaneously point to GPUNumber different memory allocations (assuming GPUNumber > 1)? Hint: It can’t. You want one pointer per allocation, e.g. an array of GPUNumber pointers.

I think that cudaSetDevice() function, you can allocate different memory on different context (GPU), can you? I just want to make sure

Your question is about keeping track of the allocations for later use. A single pointer cannot keep track of more than one allocation.

Ok. But do you know how to avoid the synchronous effect (cudaFree and cudaMalloc) when wanting to execute asynchronously kernel through different GPU?

float *d[4];
cudaSetDevice(0)
cudaMalloc(&d[0]…)
cudaSetDevice(1)
cudaMalloc(&d[1]…)
cudaSetDevice(2)
cudaMalloc(&d[2]…)
cudaSetDevice(3)
cudaMalloc(&d[3]…)

cudaSetDevice(0)
kernel<<<…>>>(d[0]…);
cudaSetDevice(1)
kernel<<<…>>>(d[1]…);
cudaSetDevice(2)
kernel<<<…>>>(d[2]…);
cudaSetDevice(3)
kernel<<<…>>>(d[3]…);

cudaSetDevice(0)
cudaFree(d[0])
cudaSetDevice(1)
cudaFree(d[1])
cudaSetDevice(2)
cudaFree(d[2])
cudaSetDevice(3)
cudaFree(d[3])

As indicated already, you would want to use 4 separate pointers for the above example.
Feel free to convert any of the above to loops, according to your knowledge of C++ programming.
But don’t change the order of operations.

Thanks Robert. But the thing is the number of pointers will be variable on runtime (depending on the user entry), each pointer will have different size. So can you suggest me the best solution for it?

All pointers have the same size (64 bits or 8 bytes typically). The number of pointers required will vary with the number of GPUs. Allocate an array of pointers dynamically, based on GPUNumber.

I’ve given you all the CUDA specific knowledge needed, already. It’s just C++ programming now.

int num_pointers;
// your code sets the above variable to something at runtime

int *sizes = new int[num_pointers];
// your code fills in the size of each pointer in bytes

float **d = new float*[num_pointers];

for (int i = 0; i<num_pointers; i++){
  cudaSetDevice(i);
  cudaMalloc(&d[i], sizes[i]);}


for (int i = 0; i<num_pointers; i++){
  cudaSetDevice(i);
  kernel<<<...>>>(d[i],....);}

for (int i = 0; i<num_pointers; i++){
  cudaSetDevice(i);
  cudaFree(d[i]);}

Ok so just use double pointers. Thanks Rob!