Concurrent Daxpy Operations using Cublas


I am currently working on a project trying to use the CublasDaxpy function. In my project I have many separate vectors that need to get a Daxpy operation applied. I would like to call as many Daxpy calls concurrently as possible. I originally tried an approach where the host side code called the cublasDaxpy function with a different stream on each call. However I am not achieving any concurrency.

// Host side

      for( int i = 0; i < VectorCount;i++)
       cudaStream_t stream_id;

       cublasSetStream(handle_, stream_id);
       cublasSafeCall(cublasDaxpy(handle, ...));         

I was wondering if I should choose a different design direction or implementation to achieve higher levels of concurrency or frankly if anyone could point me in the right direction for ideas.

How long are the vectors you are passing to DAXPY? You won’t see concurrent execution of multiple DAXPYs as long as each DAXPY is able to fully utilize the GPU resources by itself.

In the throughput-oriented processing model of the GPU, it doesn’t make sense to run kernel concurrently if the individual kernels are able to “fill” the GPU. The throughput would not go up, and it could even decrease due to the need to constantly re-schedule resources between kernels.

The vectors I’m using DAXPY on are relatively small hundreds of elements to tens of thousands of elements. I think 10 thousand is a safe upper bound.

I would expect (not verified) that DAXPY called with vectors of ten thousand elements to be able to completely fill the GPU.

My own experience with CUBLAS is that concurrency with streams is often hard to witness.

Some of the reasons for this are discussed elsewhere, and revolve around the usual difficulties of concurrent kernels:

  • kernels that have low enough resource utilization (e.g. data set size) that theoretically could be candidates for concurrency often have short enough runtimes that concurrency is not practical due to things like kernel launch overheads.

  • kernels that are long enough to work past the latency of kernel launch generally have other resource utilization (eg. threads per SM) that prevent concurrency from first principles.

a very small matrix multiply kernel might be long enough to avoid the first issue but not so large as to activate the second issue. But something like axpy will be a worst-case attempt at trying to navigate these competing issues.

If you have a high level of motivation to pursue this, the most productive path might be to write your own batch axpy kernel. I normally don’t suggest that folks attempt to write their own versions of library functions, especially things like sort, prefix sum, or matrix multiply, but an axpy kernel is not hard even for novice programmers to get “correct” and with good efficiency.

Thanks, Robert

I will pursue the batch axpy idea. One point of clarification though, when you say batched axpy do you mean the following?

Create a daxpy kernel, Call this kernel from different streams on Host side.


Create a kernel that calls sub kernals each of which are a daxpy kernel.

or am I going in the completely wrong direction :)

Thanks in advance!

I mean something similar to batched cublas operations.

Create a kernel that takes 2 arrays of pointers to x and y vectors, plus an array of constants a.

That kernel should then process that data in an expeditious, efficient fashion. I’m not sure I would use the terminology “sub kernels”.

I guess my approach would be to use a grid-stride loop. Each thread figures out which element it corresponds to, then figures out the appropriate x and y pointers to use, then calculates its offset within that vector, then does the vector add accordingly. To make this really efficient you might need to start out with prefix sum.

I probably should have also said that I would not go down this path unless the profiler indicated this activity is the bottleneck in your application. Don’t do this on a hunch or intuition.

Thanks I managed to achieve some concurrency in the profiler! My Daxpy kernel is fairly lightweight, there are probably some improvements that I can do.

// Device code
__global__ void Daxpy(double * VectorX,double *VectorY, double a, int VecLength)

  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if(idx < VecLength)
    VectorY[idx] += VectorX[idx] * a;
// Host code
for (int i = 0; i < Number_of_Rows; i++)

        Daxpy<<<blocks_per_grid,threads_per_block, impl_->streams[i].get() >>>(,

this was a bottle-neck in my application, so any concurrency is a nice gain :)