I want use cublas:cublasSdot to get two float buffer dot result.(use cuda 9.0)

In Documentation,cublasSdot have this explain:

  1. cublasStatus_t cublasSdot (cublasHandle_t handle, int n, const float *x, int incx, const float *y, int incy, float *result)
    That is to say,result could use host float* ,also could use device float*.

    this is mycode:

    cublasHandle_t handle_cosw;
        float * out_p;
        float * cos_w;
        float * result_cos;
        cudaMalloc((void**)&out_p, batch_update * inSize * sizeof(float));
        cudaMalloc((void**)&cos_w, inSize * sizeof(float));
        cudaMalloc((void**)&result_cos, batch_update * inSize * sizeof(float));
        //result_cos = (float*)malloc(batch_update * inSize * sizeof(float));
        cudaMemcpy(out_p, (float*)OutFeatures0, batch_update * inSize * sizeof(float), cudaMemcpyHostToDevice);
        cudaMemcpy(cos_w, (float*)data_cos, inSize * sizeof(float), cudaMemcpyHostToDevice);
        for(int i=0; i<batch_update; i++)
            cublasSdot (handle_cosw, inSize, out_p, 1, cos_w, 1, result_cos);

    if I use this code ,when run it will have Segmentation fault.

    but when I use annotation line code not use Line 8 code, it not have any error…

    Do I have a misunderstanding of the document or a bug in the cublas library itself.