Hello Mat.As shown in the code, I need to do a number of matrix multiplications and add the results of these matrix multiplications as follows
#pragma acc parallel loop reduction(output[0:BF_num * NFFT])
for (i = 0; i < size; i++)
{
//Matrix generation
#pragma acc loop indenpendent present(Part_w,Input_w)
for (j = 0; j < Block_BF_Array; j++)
Part_w[i] = Input_w[i * Block_BF_Array + j];
//Matrix multiplication
#pragma acc host_data use_device(Part_w,Input_data,tmp)
{
cublasZgemm(Handle_gem, CUBLAS_OP_N, CUBLAS_OP_N, NFFT, BF_num, M_Array, &alpha, Input_data, NFFT, Part_w, M_Array, &beta, tmp, NFFT);//180*NFFT
}
//Sum and add
for (j = 0; j < BF_num * NFFT; j++)
{
output[j] = output[j] + pow(my_abs(tmp[i]), 2);
}
I only realized a general idea based on my own understanding, and I have some questions.
Can I directly extract some variables on the GPU, such as
1.Input_w is a big variable, after I copy it to GPU, every time I only want to use part of it for calculation, I can only think of reassigning to other variables through OpenACC, is there any better way?
- I repeatedly call the cublas function in the loop, but I always use a handle, is that OK?