While this question is related to nvprof, that forum tends to be not active, so I am posting this question here(Mods you can move it if you want, but I think this is a more general question).
While using nvprof (–print-gpu-trace) while profiling a batch of cublas kernels prefaced by cublasSetStream, I am unsure how to interpret the profiling results.
I have been able to correctly test the simple Hyper-Q example and that ran correctly on my machine, but when I profile other applications based on the same idea the output looks different.
Essentially I have each stream execute a series of 5 cuBLAS calls, with the serial nature within that stream ID needing to be maintained. It looks like this:
float t_c;
for(int i=0;i<num_streams;i++){
cur=cublasSetStream(handle,streams[i]);
if(cur != CUBLAS_STATUS_SUCCESS){printf("error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
cur=cublasSgemv_v2(handle,CUBLAS_OP_T,sub_n,sub_m,&alpha,D_subA,sub_n,D_q+i*sub_n,1,&beta,D_temp+i*sub_m,1);
if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
cur=cublasSgemv_v2(handle,CUBLAS_OP_T,N,N,&alpha,D_IUL,N,D_temp+i*sub_m,1,&beta,D_temp2+i*sub_m,1);
if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
cur=cublasSgemv_v2(handle,CUBLAS_OP_N,sub_n,sub_m,&alpha,D_subA,sub_n,D_temp2+i*sub_m,1,&beta,D_xresult+i*sub_n,1);
if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
t_c= -(1.0f/(_rho*_rho));
cur=cublasSscal_v2(handle,sub_n,&t_c,D_xresult+i*sub_n,1);
if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
t_c=(1.0f/_rho);
cur=cublasSaxpy_v2(handle,sub_n,&t_c,D_q+i*sub_n,1,D_xresult+i*sub_n,1);
if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
}
err=cudaMemcpy(x_result,D_xresult,bigN*sizeof(float),_DTH);
if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}
When I get back the profile from nvprof it appears that all streams are being launched in serial, with no overlapping of streams.
When I use the --concurrent-kernels-off flag the output of the same profile looks more like I would expect, with the streams with different IDs overlapping like I would expect.
So essentially I have two main questions related to this issue:
-
Is there anything inherently incorrect with the above code sample if I want to get concurrency in not-dependent streams?
-
From the Windows command line, what are the appropriate flags for nvprof to verify that all the work is not being launched serially?
In this case I have 14 streams, and tested with CUDA_DEVICE_MAX_CONNECTIONS set to 8,14 and 32. It did seem that 14 was a bit faster, but they all seem to finish in about the same time ± 10%