Dear All,

I have problem regarding implementing multiple GPU.

Here is my code snap:

```
for(devID = 0; devID < N_GPU; devID++) {
cudaSetDevice(devID);
spMV_bdia_gh_kernel_M <<< grid_M, threads_M >>> (d_data[devID], d_X[devID], d_offsets[devID], WS, k, N, mtxBdiaSize, devID, n_blocks_M, d_V_M[devID]);
cudaDeviceSynchronize();
}
for(devID = 0; devID < N_GPU; devID++) {
cudaSetDevice(devID);
cublasSaxpy(handle_M[devID], N_M, &minus, d_V_M[devID], 1, d_R_M[devID], 1);
}
```

and this is profilling result:

```
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
3.07418s 6.3488ms (65536 1 1) (32 1 1) 12 112B 0B - - Tesla K20Xm (0) 1 7 spMV_bdia_gh_kernel_M(float const *, float const *, int const *, int, int, int, int, int, int, float*) [547]
3.08057s 6.3687ms (65536 1 1) (32 1 1) 12 112B 0B - - Tesla K20Xm (1) 2 14 spMV_bdia_gh_kernel_M(float const *, float const *, int const *, int, int, int, int, int, int, float*) [561]
3.08698s 131.78us (8192 1 1) (256 1 1) 20 0B 0B - - Tesla K20Xm (0) 1 7 void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>) [567]
3.08701s 132.52us (8192 1 1) (256 1 1) 20 0B 0B - - Tesla K20Xm (1) 2 14 void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>) [573]
```

I was expecting both of GPUs run in parallel, but the profilling using --print-gpu-trace shows me the GPUs run serially.

Anyone know why?

Thanks