Can you call CUBLAS function repeatedly and add up in the loop calculation?

Hello Mat.As shown in the code, I need to do a number of matrix multiplications and add the results of these matrix multiplications as follows


#pragma acc parallel loop reduction(output[0:BF_num * NFFT])
 for (i = 0; i < size; i++)
    {
        //Matrix generation
    #pragma acc loop indenpendent present(Part_w,Input_w)
        for (j = 0; j < Block_BF_Array; j++)
            Part_w[i] = Input_w[i * Block_BF_Array + j];

        //Matrix multiplication
    #pragma acc host_data use_device(Part_w,Input_data,tmp)
        {
            cublasZgemm(Handle_gem, CUBLAS_OP_N, CUBLAS_OP_N, NFFT, BF_num, M_Array, &alpha, Input_data, NFFT, Part_w, M_Array, &beta, tmp, NFFT);//180*NFFT			      
        }
        //Sum and add
        for (j = 0; j < BF_num * NFFT; j++)
        {
            output[j] = output[j] + pow(my_abs(tmp[i]), 2);
        }

I only realized a general idea based on my own understanding, and I have some questions.

Can I directly extract some variables on the GPU, such as

1.Input_w is a big variable, after I copy it to GPU, every time I only want to use part of it for calculation, I can only think of reassigning to other variables through OpenACC, is there any better way?

  1. I repeatedly call the cublas function in the loop, but I always use a handle, is that OK?

I see two big issues here.

First, you can’t call cuBlas routines from device code. Second, “host_data” defines a region where a variable’s device address is used on the host, so “host_data” can’t be used in device code.

To make this work, you’ll need to remove the “parallel loop” from the “i” loop so it’s run from the host, and then offload the two “j” loops. The host_data section can remain the same. Something like:

 for (i = 0; i < size; i++)
    {
        //Matrix generation
    #pragma acc parallel loop indenpendent present(Part_w,Input_w)
        for (j = 0; j < Block_BF_Array; j++)
            Part_w[i] = Input_w[i * Block_BF_Array + j];

        //Matrix multiplication
    #pragma acc host_data use_device(Part_w,Input_data,tmp)
        {
            cublasZgemm(Handle_gem, CUBLAS_OP_N, CUBLAS_OP_N, NFFT, BF_num, M_Array, &alpha, Input_data, NFFT, Part_w, M_Array, &beta, tmp, NFFT);//180*NFFT			      
        }
        //Sum and add
#pragma acc parallel loop
        for (j = 0; j < BF_num * NFFT; j++)
        {
            output[j] = output[j] + pow(my_abs(tmp[i]), 2);
        }

1.Input_w is a big variable, after I copy it to GPU, every time I only want to use part of it for calculation, I can only think of reassigning to other variables through OpenACC, is there any better way?

I’m not 100% clear on what you’re asking, but if you copy Input_w to the GPU once in an outer data region and then reuse it, even if it’s different parts in different compute regions, then there should no reason to partition it.

Not to say that there aren’t cases where you wouldn’t want to partition, but it’s not what’s generally done.

  1. I repeatedly call the cublas function in the loop, but I always use a handle, is that OK?

Sure, you can use the same handle with repeated calls. Generally a handle is created once and then used multiple times.

Thank you very much for your answer. But I still have a few questions,

  1. i want to sum the output vectors in the loop and finally output an output vector, so I want to parallelize the first layer of loop I, but according to what you said, it seems impossible

  2. Can i parallelize the first layer I and use cublas programs at the same time?

I changed some of my code, as follows, but I didn’t find a good way to sum

#pragma acc data create(Part_w[0:BF_num * M_Array],tmp[0:BF_num * NFFT],tmp2[0:BF_num * NFFT]) present(output[0:BF_num*NFFT],Input_data[0:M_Array*NFFT],Input_w[0:M_Array*size*BF_num])
    for (i = 0; i < size; i++)
    {
    #pragma acc host_data use_device(Input_w,Part_w,Input_data,tmp)
        {
            cublasZcopy(Handle_copy,Block_BF_Array,Input_w+i*Block_BF_Array,1,Part_w,1);
            cublasZgemm(Handle_gem, CUBLAS_OP_N, CUBLAS_OP_N, NFFT, BF_num, M_Array, &alpha, Input_data, NFFT, Part_w, M_Array, &beta, tmp, NFFT);//180*NFFT			      
        }
    #pragma acc host_data use_device(tmp2,tmp,output)
    {	
//	cublasDznrm2(Handle_nrm2,BF_num * NFFT,tmp,1,tmp2);
//	cublasDaxpy(Handle_sum,BF_num * NFFT,&alpha2,tmp2,1,output,1);
	}
	/*
	#pragma acc parallel loop  present(output,tmp)
        for (j = 0; j < BF_num * NFFT; j++)
            output[j] = output[j] + pow(my_abs(tmp[i]), 2);
        */
    }

Again, cuBLAS dropped support for device callable routines some time ago, so no you can not offload the “i” loop to the device.

You could split the “i” loop so the cuBlas calls are done first and then offload a second “i” and “j” loops, but this would require adding an extra dimension to “tmp” to hold all the results.

There’s also batched versions of cuBLAS, but again you’d need a bigger scratch array to store the results.

    for (j = 0; j < BF_num * NFFT; j++)
        output[j] = output[j] + pow(my_abs(tmp[i]), 2);

Is this code correct in that “tmp[i]” is being used, i.e. only a single element of tmp is being summed? Should this be “tmp[j]”?

Thank you very much, I have checked the cublas function myself these days and found that cublasGemmbatch can be used to solve this problem.