Concurrent kernels on Kepler

CudaaduC · February 23, 2014, 3:30am

While this question is related to nvprof, that forum tends to be not active, so I am posting this question here(Mods you can move it if you want, but I think this is a more general question).

While using nvprof (–print-gpu-trace) while profiling a batch of cublas kernels prefaced by cublasSetStream, I am unsure how to interpret the profiling results.

I have been able to correctly test the simple Hyper-Q example and that ran correctly on my machine, but when I profile other applications based on the same idea the output looks different.

Essentially I have each stream execute a series of 5 cuBLAS calls, with the serial nature within that stream ID needing to be maintained. It looks like this:

float t_c;
	for(int i=0;i<num_streams;i++){

		cur=cublasSetStream(handle,streams[i]);
		if(cur != CUBLAS_STATUS_SUCCESS){printf("error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}
	
		cur=cublasSgemv_v2(handle,CUBLAS_OP_T,sub_n,sub_m,&alpha,D_subA,sub_n,D_q+i*sub_n,1,&beta,D_temp+i*sub_m,1);
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}	

		cur=cublasSgemv_v2(handle,CUBLAS_OP_T,N,N,&alpha,D_IUL,N,D_temp+i*sub_m,1,&beta,D_temp2+i*sub_m,1);
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}	

		cur=cublasSgemv_v2(handle,CUBLAS_OP_N,sub_n,sub_m,&alpha,D_subA,sub_n,D_temp2+i*sub_m,1,&beta,D_xresult+i*sub_n,1);
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}

		t_c= -(1.0f/(_rho*_rho));
		cur=cublasSscal_v2(handle,sub_n,&t_c,D_xresult+i*sub_n,1);
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}

		t_c=(1.0f/_rho);
		cur=cublasSaxpy_v2(handle,sub_n,&t_c,D_q+i*sub_n,1,D_xresult+i*sub_n,1);
		if(cur != CUBLAS_STATUS_SUCCESS){printf("cublasCreate returned error code %d, line(%d)\n", cur, __LINE__);exit(EXIT_FAILURE);}		

	}

	err=cudaMemcpy(x_result,D_xresult,bigN*sizeof(float),_DTH);
	if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}

When I get back the profile from nvprof it appears that all streams are being launched in serial, with no overlapping of streams.

When I use the --concurrent-kernels-off flag the output of the same profile looks more like I would expect, with the streams with different IDs overlapping like I would expect.

So essentially I have two main questions related to this issue:

Is there anything inherently incorrect with the above code sample if I want to get concurrency in not-dependent streams?
From the Windows command line, what are the appropriate flags for nvprof to verify that all the work is not being launched serially?

In this case I have 14 streams, and tested with CUDA_DEVICE_MAX_CONNECTIONS set to 8,14 and 32. It did seem that 14 was a bit faster, but they all seem to finish in about the same time ± 10%

CudaaduC · February 23, 2014, 4:17am

Was able to use nvvp to profile, and the above streams are not being launched concurrently. When I reduce the number of cublas calls within the unique stream ID, then I begin to see more concurrency.

For five cublas calls per stream ID, there is no overlap, for three there is a modest amount of overlap, for 1 there is a good amount of overlap.

So it is the size of the cublas call(s) that affect the degree of concurrency, the number of calls withing each stream ID, or both?

In this case I am performing a dense matrix-vector multiplication with the matrix dimensions (1638,1920), is that considered large?

njuffa · February 23, 2014, 4:57am

Up front: I am not an expert on the use of concurrent kernels, my hands-on experience in that regard is very limited.

My understanding is that the support for concurrent kernels is designed to allow the GPU to be filled more effectively with small, independent kernels that by themselves cannot fill the GPU and would cause much of the GPU to idle if executed by themselves.

It follows that there is not much benefit from concurrent kernels if each kernel is large enough to keep the GPU completely or mostly occupied by itself; there is probably some incremental benefit from residual overlap during the ramp-up and ramp-down phases of larger kernels. Note that kernels in the same stream have an implied dependency so are not independent of each other.

I think this second point in particular explains most of your observation that with only one kernel launched per stream, you are observing good overlap, and that the effective overlap descreases as the number of dependent kernels per stream increases. It is difficult to tell how much the CUBLAS calls for matrices of the stated dimension are able to fill the GPU, especially without knowing the specific GPU used. What occupancy is reported for these kernels?

CudaaduC · February 23, 2014, 5:23am

njuffa,

I was able to find a better way to get concurrency by having multiple loops like this:

float t_c;
	int i;
	
	for(i=0;i<num_streams;i++){
		cublasSetStream(handle,streams[i]);
		cublasSgemv_v2(handle,CUBLAS_OP_T,sub_n,sub_m,&alpha,D_subA,sub_n,D_q+i*sub_n,1,&beta,D_temp+i*sub_m,1);//A*q into tempvecR
	}
	
	for(i=0;i<num_streams;i++){
		cublasSetStream(handle,streams[i]);
		cublasSgemv_v2(handle,CUBLAS_OP_T,N,N,&alpha,D_IUL,N,D_temp+i*sub_m,1,&beta,D_temp2+i*sub_m,1);//(inv(U)*inv(L))* tempvecR into D_xresult
		cublasSgemv_v2(handle,CUBLAS_OP_N,sub_n,sub_m,&alpha,D_subA,sub_n,D_temp2+i*sub_m,1,&beta,D_xresult+i*sub_n,1);//A'*temp 'L' into d_xresults
	}
	t_c= -(1.0f/(_rho*_rho));
	cublasSscal_v2(handle,bigN,&t_c,D_xresult,1);//scale D_xresult by -1/rho^2
	t_c=(1.0f/_rho);	
	cublasSaxpy_v2(handle,bigN,&t_c,D_q,1,D_xresult,1);// D_xresult=q/rho+ D_xresult*/o^2

The Sscal and Saxpy did not need to been broken up, so I was able to take those out of the stream loops.

The output from nvvp shows that the second loop(starting on line 9) was able to launch concurrent streams with decent overlap of between the streams of the two inner Sgemv blocks. This method was about 35% faster than my first attempt, so trial and error paid off in this case.

The occupancy was not that high, but will look into that further.

Thanks.

CudaaduC · February 23, 2014, 6:15am

Here is a screenshot with most metrics, and I must say it ‘sounds’ bad.

This test executable only has cuBLAS calls, with no custom kernels. Given that, I have no idea how to improve those metrics.

link to screenshot:

[url]http://i.imgur.com/7b8CASo.png[/url]

njuffa · February 23, 2014, 7:59am

I guess it may make sense to “zoom out” a little. Are the CUBLAS calls collectively implementing some custom processing step, or a well-known higher level operation, such as a decomposition of sorts? I am eyeballing the code but haven’t been able to determine what operation the code performs.

CudaaduC · February 23, 2014, 9:06pm

That code was a sub-section of a MATLAB callable version of the ADMM group lasso(using dense matrices single precision).

This will be one of the topics I will discuss at GTC with UCSF/UCSD, and this implementation has the additional feature of testing multiple lambda values in parallel;

https://github.com/OlegKonings/GLwithmex

My job was to translate it from MATLAB script to CUDA, and since the solver is different for ‘skinny’ or ‘fat’ matrices, I have been optimizing this line for the ‘fat’ version;

x = q/rho - (A'*(U \ ( L \ (A*q) )))/rho^2;

Too bad there is no backslash in CUDA, but it does currently work, and is 11-21 times faster than the multi-core CPU 3.9 Ghz MATLAB implementation.

The multiple-lambda aspect was an idea from Tim Mullen at UCSD who will also be speaking at GTC.

Since I still have a month until GTC, I am profiling like a madman, and also will be pushing a sparse version which operates on a block-diagonal Matrix.

The biggest issue I have had was the MATLAB overhead of the mex call. The first ‘cold’ call of the solver adds 250-500 ms of overhead, but the subsequent calls only add about 10ms. Also there is a great deal of casting because MATLAB uses doubles for everything and also uses column-major, while I used row-major(if anyone is going to use examine the README and the example GLtestMultiLambda.m file).

njuffa · February 23, 2014, 10:50pm

If I may ask, what drove the decision to use row-major storage? CUBLAS was designed for column-major storage for ease of interfacing to existing Fortran host codes, and popular software like MATLAB.

CudaaduC · February 23, 2014, 10:58pm

Since my background is in discrete math, that is how I visualize matrix operations(used to working with adjacency matrices with graphs, which are usually in row-major).

cuBLAS allows the user to read the matrices in transpose , and I made all the required adjustments to the blas parameters for the calls.

Probably will re-write, but for now it works correctly and the MATLAB mex caller just has to transpose the Matrix and I internally swap [m,n] to be correct.

Topic		Replies	Views
cuBLAS kernels always run serially despite streams and AsyncMemCpy?!? CUDA Programming and Performance	17	6020	September 30, 2015
How to concurrent cublas-sgemm by stream? CUDA Programming and Performance	3	6420	February 24, 2017
Kernel launch concurrency CUDA Programming and Performance	10	1950	December 11, 2014
Newbie question about cublas CUDA Programming and Performance	10	3471	December 2, 2010
Concurrent Kernels with CUSPARSE Library CUDA Programming and Performance	4	921	October 22, 2015
No performance improvement using CUDA stream DRIVE AGX Xavier General driveos-cuda	11	1752	March 22, 2022
Using cudaStream, but nvprof shows sequential launch CUDA Programming and Performance	5	1103	June 18, 2016
Profiling CUDA Programming and Performance	2	883	August 17, 2015
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2319	October 26, 2016
Cannot see concurrent kenrel execution by stream CUDA Programming and Performance	2	590	November 16, 2017

Concurrent kernels on Kepler

Related topics