Problem with CublasDger() function

CublasDger() is performing slower than the sequential code.
Below is the code for cuda version :

static void outer(const real_t *aBegin, const real_t *aEnd, real_t *M, const real_t *b, const real_t *bEnd)
        const real_t* mStart = M;        
        const int s_a = distance(aBegin, aEnd);
        const int s_b = distance(b, bEnd);

        cudaDeviceProp deviceProp;
        cudaError_t error;

        real_t *d_m, *d_a, *d_b;
        int size = sizeof(real_t);

        error = cudaMalloc((void **) &d_m, size*s_a*s_b);
        error = cudaMalloc((void **) &d_a, size*s_a);
        error = cudaMalloc((void **) &d_b, size*s_b);

        cublasSetMatrix(s_a,s_b, size, M, s_a, d_m, s_a);
        cublasSetVector(s_a, size, aBegin, 1, d_a, 1);
        cublasSetVector(s_b, size, b, 1, d_b, 1);
        cublasHandle_t handle;
        cublasStatus_t ret;
        const double alf = 1.0;
        const double bet = 1.0;
        const double *alpha=&alf;
        const double *beta=&bet;
        ret = cublasCreate(&handle);
        ret = cublasDger(handle, s_a, s_b, alpha, d_a, 1, d_b, 1, d_m, s_a);

        cublasGetMatrix(s_a,s_b, size, d_m, s_a, M, s_a);

        matrixOps += M - mStart;

We are not able to understand why the code is running slower than sequential. The function is being called many times in our code. Is there any way to find, which part of the code is creating the problem.


cublasDger has low compute intensity per matrix element, and is largely a memory-bound operation.

Your posted code is structured suggesting that this is the only operation you are doing on the GPU (you transfer the data to the GPU, perform Dger, then transfer the data back to the host).

If that is the case, this operational sequence (transfer data to GPU, perform a single, low-compute-intensity memory bound operation, transfer data back to host) is unlikely to witness much benefit from GPU acceleration.

You’re more likely to witness a benefit from the GPU if you can transfer more of your overall algorithm to the GPU, and avoid the data transfers back and forth at every step.

Furthermore, the cudaMalloc and cudaFree operations are “expensive”, and they should be avoided in time-critical code/code loops. Perform those once, at the beginning of your application, and re-use the allocations as much as possible.

Thanks for the reply. Actually that is the only function we are running on GPU.

Can we use the Unified memory for better performance. If so, usually how much speedup can we expect?

You’re not likely to witness much benefit with unified memory. It may help somewhat in the area of allocation of memory, but it won’t help at all with the data transfers - the data must still be moved to the GPU even with UM, and the results must still be moved back to the host.

Presumably your entire application does something beyond just rank-1 updates. The best suggestion is to try to figure out how to move more work to the GPU.