Multi-GPU array

I run code which launches a kernel on four GPUs.

I would like to create an array of size four, each of whose elements is a pointer to a different device global memory.

The host will use this array to copy to and from the four different devices’ global memories.

The usual code pattern when using multiple GPUs is to do the following on each device (I’ll do it for instance for GPU #0):

cudaSetDevice(0);
cudaStreamCreate(&str0);
cudaMalloc((void **)&d_0,N*sizeof(int));
cudaMallocHost((void **)&h_0,N*sizeof(int));

I can make three more copies of the above code, with str1, str2, str3, d_1,d_2,d_3, …

But instead, I would like to do something like:

for (int i=0; i<deviceCount; i++){
    cudaSetDevice(i);
	cudaStreamCreate(&stream[i]);
	cudaMalloc((void **)&d_mem[i],N*sizeof(int));
	cudaMallocHost((void **)&h_mem[i],N*sizeof(int));
}

where deviceCount is 4 and where d_mem is defined as:

char **d_mem;

Unfortunately, I am getting some non-perfect behavior (I’m not sure if what I describe below is the reason).

First Question: can I use standard malloc to allocate space for d_mem, as follows:

d_mem=(char **)malloc(sizeof(char *)*deviceCount);

while still for its elements do cudaMalloc as written above, cudaMalloc((void **)&d_mem[i],N*sizeof(int));

or do I need to use cudaMalloc like:

`cudaMalloc((void **)&d_mem,sizeof(char *)*devicecount);`

If the answer is the latter, then being that I need to choose a device before doing any cuda command, then what value for device should I choose in cudaSetDevice before the command cudaMalloc((void **)&d_mem,sizeof(void *)*devicecount); ?

(I am currently doing the former).

Second Question:
What about the array, h_mem in the above code (whose elements are pointers to pinned host memory)? should I allocate memory for this array using malloc or cudaMallocHost?

If the answer is latter, then being that I need to choose a device before doing any cuda command, then what value for device should I choose in cudaSetDevice before the command cudaMallocHost((void **)&h_mem,sizeof(void *)*devicecount); ?

(I am currently doing the former).

It turns out my problem was caused by something else (related to synchronizing streams).

But still, I am interested to know the answer to the above questions

One could do it like this.

int** h_pointersToDeviceArrays = (int**) malloc(sizeof(int*) * devicecount);
for (int i=0; i<devicecount; i++){
    cudaSetDevice(i);
	cudaStreamCreate(&stream[i]);
	cudaMalloc((void **)&h_pointersToDeviceArrays[i],N*sizeof(int));
}
//repeat for each gpu which should have access to the pointer array
int** d_pointersToDeviceArrays;
cudaMalloc((void**)&d_pointersToDeviceArrays, sizeof(int*) * devicecount);
cudaMemcpy(d_pointersToDeviceArrays, h_pointersToDeviceArrays, sizeof(int*) * devicecount, cudaMemcpyHostToDevice);

//use d_pointersToDeviceArrays in kernel

//use pointers in h_pointersToDeviceArrays to deallocate device arrays

Whether you should prefer cudaMallocHost over malloc depends on your program. Availability of pinned memory, asynchronous transfer, etc…