Givensrotation with cuBlas

Hello,

I am new to CUDA and I tried to implement a Givensroatation with cuBlas libary.

I used

cublasStatus_t cublasSrotg(cublasHandle_t handle,
float *a, float *b,
float *c, float *s)

for calculating c and s and

cublasStatus_t cublasSrot (cublasHandle_t handle, int n,
float *x, int incx,
float *y, int incy,
const float *c, const float *s)

for roatation of an 1000 x 1000 matrix. As cublaSrotg() overwrites the parameters a and b with r and z I am not able to use the matrix in device memory. But copying elements of the matrix to the host takes about 90 percent of any cycle of the loop. A implementation in C is up to three times faster.

Am i using it wrong? What will be a better alternative for Givens rotation on CUDA devices?