Using Multiple GPU Turns Out Running Serially

Dear All,

I have problem regarding implementing multiple GPU.
Here is my code snap:

for(devID = 0; devID < N_GPU; devID++) {
      cudaSetDevice(devID);
      spMV_bdia_gh_kernel_M <<< grid_M, threads_M >>> (d_data[devID], d_X[devID], d_offsets[devID], WS, k, N, mtxBdiaSize, devID, n_blocks_M, d_V_M[devID]);
      cudaDeviceSynchronize();
    }
    for(devID = 0; devID < N_GPU; devID++) {
      cudaSetDevice(devID);
      cublasSaxpy(handle_M[devID], N_M, &minus, d_V_M[devID], 1, d_R_M[devID], 1);
    }

and this is profilling result:

Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name

3.07418s  6.3488ms          (65536 1 1)        (32 1 1)        12      112B        0B         -           -  Tesla K20Xm (0)         1         7  spMV_bdia_gh_kernel_M(float const *, float const *, int const *, int, int, int, int, int, int, float*) [547]
3.08057s  6.3687ms          (65536 1 1)        (32 1 1)        12      112B        0B         -           -  Tesla K20Xm (1)         2        14  spMV_bdia_gh_kernel_M(float const *, float const *, int const *, int, int, int, int, int, int, float*) [561]
3.08698s  131.78us           (8192 1 1)       (256 1 1)        20        0B        0B         -           -  Tesla K20Xm (0)         1         7  void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>) [567]
3.08701s  132.52us           (8192 1 1)       (256 1 1)        20        0B        0B         -           -  Tesla K20Xm (1)         2        14  void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>) [573]

I was expecting both of GPUs run in parallel, but the profilling using --print-gpu-trace shows me the GPUs run serially.

Anyone know why?

Thanks

I believe that the host is waiting for GPU 0 finish before it gets to GPU 1 due to the cudaDeviceSynchronize();

Also it is good practice to create different streams for different GPUs, but try taking out that cudaDeviceSynchronize() in the first loop and then after both GPUs have been given their work then apply a cudaDeviceSynchronize() to both with the corresponding cudaSetDevice() for each.

Wow, thanks a lot! I’ve tried that before, but it’s not working. But I wonder why it’s working now. LOL

solved, thanks :)