Concurrent kernels on Kepler

While this question is related to nvprof, that forum tends to be not active, so I am posting this question here(Mods you can move it if you want, but I think this is a more general question).

While using nvprof (–print-gpu-trace) while profiling a batch of cublas kernels prefaced by cublasSetStream, I am unsure how to interpret the profiling results.

I have been able to correctly test the simple Hyper-Q example and that ran correctly on my machine, but when I profile other applications based on the same idea the output looks different.

Essentially I have each stream execute a series of 5 cuBLAS calls, with the serial nature within that stream ID needing to be maintained. It looks like this:

float t_c;
	for(int i=0;i<num_streams;i++){

		if(cur != CUBLAS_STATUS_SUCCESS){printf("error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}	

		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}	

		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}

		t_c= -(1.0f/(_rho*_rho));
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}

		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}		


	if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}

When I get back the profile from nvprof it appears that all streams are being launched in serial, with no overlapping of streams.

When I use the --concurrent-kernels-off flag the output of the same profile looks more like I would expect, with the streams with different IDs overlapping like I would expect.

So essentially I have two main questions related to this issue:

  1. Is there anything inherently incorrect with the above code sample if I want to get concurrency in not-dependent streams?

  2. From the Windows command line, what are the appropriate flags for nvprof to verify that all the work is not being launched serially?

In this case I have 14 streams, and tested with CUDA_DEVICE_MAX_CONNECTIONS set to 8,14 and 32. It did seem that 14 was a bit faster, but they all seem to finish in about the same time ± 10%

Was able to use nvvp to profile, and the above streams are not being launched concurrently. When I reduce the number of cublas calls within the unique stream ID, then I begin to see more concurrency.

For five cublas calls per stream ID, there is no overlap, for three there is a modest amount of overlap, for 1 there is a good amount of overlap.

So it is the size of the cublas call(s) that affect the degree of concurrency, the number of calls withing each stream ID, or both?

In this case I am performing a dense matrix-vector multiplication with the matrix dimensions (1638,1920), is that considered large?

Up front: I am not an expert on the use of concurrent kernels, my hands-on experience in that regard is very limited.

My understanding is that the support for concurrent kernels is designed to allow the GPU to be filled more effectively with small, independent kernels that by themselves cannot fill the GPU and would cause much of the GPU to idle if executed by themselves.

It follows that there is not much benefit from concurrent kernels if each kernel is large enough to keep the GPU completely or mostly occupied by itself; there is probably some incremental benefit from residual overlap during the ramp-up and ramp-down phases of larger kernels. Note that kernels in the same stream have an implied dependency so are not independent of each other.

I think this second point in particular explains most of your observation that with only one kernel launched per stream, you are observing good overlap, and that the effective overlap descreases as the number of dependent kernels per stream increases. It is difficult to tell how much the CUBLAS calls for matrices of the stated dimension are able to fill the GPU, especially without knowing the specific GPU used. What occupancy is reported for these kernels?


I was able to find a better way to get concurrency by having multiple loops like this:

float t_c;
	int i;
		cublasSgemv_v2(handle,CUBLAS_OP_T,sub_n,sub_m,&alpha,D_subA,sub_n,D_q+i*sub_n,1,&beta,D_temp+i*sub_m,1);//A*q into tempvecR
		cublasSgemv_v2(handle,CUBLAS_OP_T,N,N,&alpha,D_IUL,N,D_temp+i*sub_m,1,&beta,D_temp2+i*sub_m,1);//(inv(U)*inv(L))* tempvecR into D_xresult
		cublasSgemv_v2(handle,CUBLAS_OP_N,sub_n,sub_m,&alpha,D_subA,sub_n,D_temp2+i*sub_m,1,&beta,D_xresult+i*sub_n,1);//A'*temp 'L' into d_xresults
	t_c= -(1.0f/(_rho*_rho));
	cublasSscal_v2(handle,bigN,&t_c,D_xresult,1);//scale D_xresult by -1/rho^2
	cublasSaxpy_v2(handle,bigN,&t_c,D_q,1,D_xresult,1);// D_xresult=q/rho+ D_xresult*/o^2

The Sscal and Saxpy did not need to been broken up, so I was able to take those out of the stream loops.

The output from nvvp shows that the second loop(starting on line 9) was able to launch concurrent streams with decent overlap of between the streams of the two inner Sgemv blocks. This method was about 35% faster than my first attempt, so trial and error paid off in this case.

The occupancy was not that high, but will look into that further.


Here is a screenshot with most metrics, and I must say it ‘sounds’ bad.

This test executable only has cuBLAS calls, with no custom kernels. Given that, I have no idea how to improve those metrics.

link to screenshot:

I guess it may make sense to “zoom out” a little. Are the CUBLAS calls collectively implementing some custom processing step, or a well-known higher level operation, such as a decomposition of sorts? I am eyeballing the code but haven’t been able to determine what operation the code performs.

That code was a sub-section of a MATLAB callable version of the ADMM group lasso(using dense matrices single precision).

This will be one of the topics I will discuss at GTC with UCSF/UCSD, and this implementation has the additional feature of testing multiple lambda values in parallel;

My job was to translate it from MATLAB script to CUDA, and since the solver is different for ‘skinny’ or ‘fat’ matrices, I have been optimizing this line for the ‘fat’ version;

x = q/rho - (A'*(U \ ( L \ (A*q) )))/rho^2;

Too bad there is no backslash in CUDA, but it does currently work, and is 11-21 times faster than the multi-core CPU 3.9 Ghz MATLAB implementation.

The multiple-lambda aspect was an idea from Tim Mullen at UCSD who will also be speaking at GTC.

Since I still have a month until GTC, I am profiling like a madman, and also will be pushing a sparse version which operates on a block-diagonal Matrix.

The biggest issue I have had was the MATLAB overhead of the mex call. The first ‘cold’ call of the solver adds 250-500 ms of overhead, but the subsequent calls only add about 10ms. Also there is a great deal of casting because MATLAB uses doubles for everything and also uses column-major, while I used row-major(if anyone is going to use examine the README and the example GLtestMultiLambda.m file).

If I may ask, what drove the decision to use row-major storage? CUBLAS was designed for column-major storage for ease of interfacing to existing Fortran host codes, and popular software like MATLAB.

Since my background is in discrete math, that is how I visualize matrix operations(used to working with adjacency matrices with graphs, which are usually in row-major).

cuBLAS allows the user to read the matrices in transpose , and I made all the required adjustments to the blas parameters for the calls.

Probably will re-write, but for now it works correctly and the MATLAB mex caller just has to transpose the Matrix and I internally swap [m,n] to be correct.