I run code which launches a kernel on four GPUs.
I would like to create an array of size four, each of whose elements is a pointer to a different device global memory.
The host will use this array to copy to and from the four different devices’ global memories.
The usual code pattern when using multiple GPUs is to do the following on each device (I’ll do it for instance for GPU #0):
cudaSetDevice(0);
cudaStreamCreate(&str0);
cudaMalloc((void **)&d_0,N*sizeof(int));
cudaMallocHost((void **)&h_0,N*sizeof(int));
I can make three more copies of the above code, with str1, str2, str3, d_1,d_2,d_3, …
But instead, I would like to do something like:
for (int i=0; i<deviceCount; i++){
cudaSetDevice(i);
cudaStreamCreate(&stream[i]);
cudaMalloc((void **)&d_mem[i],N*sizeof(int));
cudaMallocHost((void **)&h_mem[i],N*sizeof(int));
}
where deviceCount is 4 and where d_mem
is defined as:
char **d_mem;
Unfortunately, I am getting some non-perfect behavior (I’m not sure if what I describe below is the reason).
First Question: can I use standard malloc
to allocate space for d_mem
, as follows:
d_mem=(char **)malloc(sizeof(char *)*deviceCount);
while still for its elements do cudaMalloc as written above, cudaMalloc((void **)&d_mem[i],N*sizeof(int));
or do I need to use cudaMalloc like:
`cudaMalloc((void **)&d_mem,sizeof(char *)*devicecount);`
If the answer is the latter, then being that I need to choose a device before doing any cuda command, then what value for device should I choose in cudaSetDevice
before the command cudaMalloc((void **)&d_mem,sizeof(void *)*devicecount);
?
(I am currently doing the former).
Second Question:
What about the array, h_mem
in the above code (whose elements are pointers to pinned host memory)? should I allocate memory for this array using malloc
or cudaMallocHost
?
If the answer is latter, then being that I need to choose a device before doing any cuda command, then what value for device should I choose in cudaSetDevice
before the command cudaMallocHost((void **)&h_mem,sizeof(void *)*devicecount);
?
(I am currently doing the former).