Segmentation fault on Multi-GPU Implementation

Dear all,

I have problem implementing multi-gpu on cublas kernel combined with my own kernel.

If I use only cublas kernel, it’s working:
for(devID = 0; devID < N_GPU; devID++) {
cudaSetDevice(devID);
cublasScopy(handle_M[devID], N_M, d_B_M[devID], 1, d_R_M[devID], 1);
}

if I use only my kernel, it’s also working:
for(devID = 0; devID < N_GPU; devID++) {
cudaSetDevice(devID);
spMV_bdia_gh_kernel_M <<< grid_M, threads_M >>> (d_data, d_X, d_offsets, WS, k, N, mtxBdiaSize, devID, n_blocks_M, d_V_M[devID]);
cudaDeviceSynchronize();
}

but if I combined both of them, it gives me error “Segmentation fault (core dumped)”.
for(devID = 0; devID < N_GPU; devID++) {
cudaSetDevice(devID);
cublasScopy(handle_M[devID], N_M, d_B_M[devID], 1, d_R_M[devID], 1);
spMV_bdia_gh_kernel_M <<< grid_M, threads_M >>> (d_data, d_X, d_offsets, WS, k, N, mtxBdiaSize, devID, n_blocks_M, d_V_M[devID]);
cudaDeviceSynchronize();
}

I wonder how could this happened?

Thanks,

how could i delete this topic? wrong question