The process of cublasSasum_v2 make errors when the output is on device memories

Well, the problem that I got into for several days is that, the cublasSasum_v2 works fine when the output(dresults) is on host memories(see the code ->“float *dresults = new float[C];”), but when the output is on device memories, it throws error(with the output given by the code->“float *dresults = nullptr;CHECK(cudaMalloc((void **) &dresults, C * sizeof(float)));”)! Anybody has an idea to save me out?!

int size = H * W, i;
    cublasStatus_t status;
    cublasHandle_t handle[C];
    for (i = 0; i < C; ++i) {
        status = cublasCreate_v2(&handle[i]);
        if (status != CUBLAS_STATUS_SUCCESS) {
            std::cout << "#" << i << ", CUBLAS initialization error:" << status << std::endl;
            abort();
        }
        cublasSetStream_v2(handle[i], stream);
    }
//    float *dresults = new float[C];
    float *dresults = nullptr;
    CHECK(cudaMalloc((void **) &dresults, C * sizeof(float)));
    for (i = 0; i < C; ++i) {
        status = cublasSasum_v2(handle[i], size, ((float *) gpu_buffer_in) + size * i, 1, &dresults[i]);
        if (status != CUBLAS_STATUS_SUCCESS) {
            std::cout << "num:" << i << ", Cublas failure: " << status << std::endl;
                    abort();
        }
    }
    CHECK(cudaFree(dresults));
    for (i = 0; i < C; ++i) {
        cublasDestroy_v2(handle[i]);
    }

Hi,

Please check the document of cublasSasum_v2 for information.

If a function expects users a CPU buffer, please don’t pass a GPU buffer pointer into it.
An segmentation fault will occurs if users want to read/write GPU memory with CPU.

Thanks.

Thank you for you reply, Aastall.
I have solved this problem by setting the pointer mode to device.
But I met another problem. It seems that there is no function for computing the real values of the sum of elements in a vector; the function cublasSasum_v2 just compute the absolute values of the elements.

Hi,

You can check if there is an available function in cublas by the document here:
[url]cuBLAS :: CUDA Toolkit Documentation

Or another alternative is to implement it with CUDA.
There are some samples to deal with complex number and can give you some hint.

Thanks.