I’m having some difficulty in getting cuBLAS kernels to run in parallel. I’m using sgemm and hgemm kernels at relatively small sized matrices (say 800x400). NVVP reports that the kernels are running serially and have a theoretical occupancy of 50%. No problem, I figure I’ll just use Streams to parallelize since I’ve several such matrices to compute. However, no matter what I try, the kernals that cuBLAS generates are serialized.
I figured a simple setup as below would get parallel execution, but alas it is serial both under Cuda 6.5 on Tegra K1 and 7.0 on Tegra X1.
// Get cuBLAS handle, etc.
check_error(cudaMallocHost(&source, 500*500*3*sizeof(float)));
cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
// fill it with some data
// ...
// Space on the GPU.. Avoid complications on zero copy for now.
check_error(cudaMalloc(&firstMat, 500*500*3*sizeof(float)));
check_error(cudaMalloc(&secondMat, 500*500*3*sizeof(float)));
check_error(cudaMalloc(&result1, 500*500*3*sizeof(float)));
check_error(cudaMalloc(&result2, 500*500*3*sizeof(float)));
//Do some ASync MemCopies
check_error(cudaMemcpyAsync(firstMat, source, 500*500*3*sizeof(float), cudaMemcpyHostToDevice, stream1));
check_error(cudaMemcpyAsync(secondMat, source, 500*500*3*sizeof(float), cudaMemcpyHostToDevice, stream2));
for(i=0; i<20; i++) {
cublasSetStream(handle, stream1);
check_error(cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 500, 500, 500, &ALPHA, firstMat, 500, secondMat, 500, &BETA, result1, 500));
cublasSetStream(handle, stream2);
check_error(cublasHgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 500, 500, 500, &ALPHA, firstMat, 500, secondMat, 500, &BETA, result2, 500));
}
Instead of parallel execution, I get kernels running on different streams but serialized. I even tried a variation of this example using threads and the --default-stream per-thread option for streams, but I get serial execution also. Although in that case, it uses the default stream as well as new streams for some reason rather than solely using the stream per thread. :-(
I must be missing something obvious? Please help!