How to Accelerate Vector Point Multiplication Faste?

Hello, Mat. I have two very large vectors that I need to perform point multiplication on each element. Now I have tried some parallel solutions, but the effect seems not ideal. It still takes several tens of microseconds. Is there a faster solution

./dot
NN=120000000
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, at ../../src-cupti/prof_cuda_cupti.c:338.
t_12=51750
t_23=51890
CC=-17.000000  16.000000

The running code is as follows

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#include<openacc.h>
#include "device_launch_parameters.h"

#define NN 200*200*3000
//时间
#if defined(_WIN32) || defined(_WIN64)
#include <sys/timeb.h>
#define gettime(a) _ftime(a)
#define usec(t1,t2) ((((t2).time-(t1).time)*1000+((t2).millitm-(t1).millitm))*100)
typedef struct _timeb timestruct;
#else
#include <sys/time.h>
#define gettime(a) gettimeofday(a,NULL)
#define usec(t1,t2) (((t2).tv_sec-(t1).tv_sec)*1000000+((t2).tv_usec-(t1).tv_usec))
typedef struct timeval timestruct;
#endif

#pragma acc routine seq
cuDoubleComplex operator*(cuDoubleComplex &a1,cuDoubleComplex &a2 )
{
   cuDoubleComplex res;
   res.x = a1.x*a2.x-a1.y*a2.y;
   res.y = a1.x*a2.y+a1.y*a2.x;
   return res;
}


int main()
{
    cudaError_t cudaStatus;
    cublasStatus_t cublasStatus;
    cublasHandle_t handle;

    cuDoubleComplex*AA=(cuDoubleComplex*)malloc(NN*sizeof(cuDoubleComplex));

    cuDoubleComplex*BB=(cuDoubleComplex*)malloc(NN*sizeof(cuDoubleComplex));

    cuDoubleComplex*CC=(cuDoubleComplex*)malloc(NN*sizeof(cuDoubleComplex));

    cublasStatus = cublasCreate(&handle);
	printf("NN=%d\n",NN);
	for(int i=0;i<NN;i++)
	{
	AA[i]={1,2};
	BB[i]={3,10};
	CC[i].x=0;
	CC[i].y=0;
	}
#pragma acc enter data copyin(AA[0:NN],BB[0:NN],CC[0:NN])
  	timestruct t1, t2,t3;
    	long long t_12,t_23;
    	gettime(&t1);
    	#pragma acc kernels present(AA[0:NN],BB[0:NN],CC[0:NN])
    	{
    	for(int i=0;i<NN;i++)
    	CC[i]=cuCmul(AA[i],BB[i]);
    	}
    	gettime(&t2);    	
    	#pragma acc parallel present(AA[0:NN],BB[0:NN],CC[0:NN])
    	{
    	#pragma acc loop independent
    	for(int i=0;i<NN;i++)
    	CC[i]=AA[i]*BB[i];
    	}

    	gettime(&t3);
   	t_12=usec(t1,t2);
   	t_23=usec(t2,t3);   	
   	printf("t_12=%lld\n",t_12);
   	printf("t_23=%lld\n",t_23);
#pragma acc exit data copyout(CC[0:NN])
   	printf("CC=%f  %f\n",CC[2].x,CC[2].y);

    return 0;
}

This bit of code is going to run serially on the device as shown in the compiler feedback messages (-Minfo=accel):

     59, Complex loop carried dependence of AA->x,BB->x,AA->y,BB->y prevents parallelization
         Loop carried dependence of CC->x prevents parallelization
         Loop carried backward dependence of CC->x prevents vectorization
         Accelerator serial kernel generated

Due to pointer aliasing, the compiler can’t tell if these pointers are pointing to the same object or not. Hence the dependency analysis fails and the loop is run serailly.

You can either set the flag “-Msafeptr” which asserts that there is no pointer aliasing, or add “loop independent” to assert to the compiler that the loop has no dependencies.

        #pragma acc kernels present(AA[0:NN],BB[0:NN],CC[0:NN])
        {
        #pragma acc loop independent
        for(int i=0;i<NN;i++)
        CC[i]=cuCmul(AA[i],BB[i]);
        }

-Mat

I tried the method I just mentioned, but it seems that the computing speed has not improved. My device platform is NVIDIA Jetson Orin 32G

nvc++ -o dot -acc -fast  -gpu=cc87 -cuda -cudalib -Minfo=accel -Msafeptr  dot.c


 ./dot
NN=120000000
CUPTI ERROR: cuptiActivityEnable(CUPTI_ACTIVITY_KIND_KERNEL) returned: CUPTI_ERROR_INSUFFICIENT_PRIVILEGES,  at ../../src-cupti/prof_cuda_cupti.c:338.
t_12=51311
t_23=51667
CC=-17.000000  16.000000

Not sure there’s anything more that you can do.

Here’s a Nsight-Compute profile from a H100. Occupancy is near 100% but compute throughput is at 6.5%, memory throughput is at 92%, and L2 throughput at 81%.

Basically, your code is memory bound so your performance is limited to the memory speed, which is already just at peak.

I’m having issues getting a profile on an Orin systems, but assume it would have a similar profile.

==PROF== Disconnected from process 391157
[391157] a.out@127.0.0.1
  main_60 (65535, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         1.59
    SM Frequency            cycle/nsecond         1.09
    Elapsed Cycles                  cycle    3,355,032
    Memory Throughput                   %        92.01
    DRAM Throughput                     %        92.01
    Duration                      msecond         3.06
    L1/TEX Cache Throughput             %        19.77
    L2 Cache Throughput                 %        80.97
    SM Active Cycles                cycle 3,339,382.25
    Compute (SM) Throughput             %         6.58
    ----------------------- ------------- ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
          further improve performance, work will likely need to be shifted from the most utilized to another unit.
          Start by analyzing DRAM in the Memory Workload Analysis section.

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Cluster Scheduling Policy                           PolicySpread
    Cluster Size                                                   0
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                 65,535
    Registers Per Thread             register/thread              22
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    Threads                                   thread       8,388,480
    Waves Per SM                                               35.93
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Max Active Clusters                 cluster            0
    Max Cluster Size                      block            8
    Overall GPU Occupancy                     %            0
    Cluster Occupancy                         %            0
    Block Limit SM                        block           32
    Block Limit Registers                 block           21
    Block Limit Shared Mem                block           32
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp           64
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        94.54
    Achieved Active Warps Per SM           warp        60.51
    ------------------------------- ----------- ------------

    INF   This kernel's theoretical occupancy is not impacted by any block limit.

  main_65 (65535, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 9.0
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ------------
    Metric Name               Metric Unit Metric Value
    ----------------------- ------------- ------------
    DRAM Frequency          cycle/nsecond         1.59
    SM Frequency            cycle/nsecond         1.09
    Elapsed Cycles                  cycle    3,356,298
    Memory Throughput                   %        92.00
    DRAM Throughput                     %        92.00
    Duration                      msecond         3.07
    L1/TEX Cache Throughput             %        19.77
    L2 Cache Throughput                 %        80.62
    SM Active Cycles                cycle 3,340,528.54
    Compute (SM) Throughput             %         4.13
    ----------------------- ------------- ------------

    INF   The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
          further improve performance, work will likely need to be shifted from the most utilized to another unit.
          Start by analyzing DRAM in the Memory Workload Analysis section.

    Section: Launch Statistics
    -------------------------------- --------------- ---------------
    Metric Name                          Metric Unit    Metric Value
    -------------------------------- --------------- ---------------
    Block Size                                                   128
    Cluster Scheduling Policy                           PolicySpread
    Cluster Size                                                   0
    Function Cache Configuration                     CachePreferNone
    Grid Size                                                 65,535
    Registers Per Thread             register/thread              22
    Shared Memory Configuration Size           Kbyte           32.77
    Driver Shared Memory Per Block       Kbyte/block            1.02
    Dynamic Shared Memory Per Block       byte/block               0
    Static Shared Memory Per Block        byte/block               0
    Threads                                   thread       8,388,480
    Waves Per SM                                               35.93
    -------------------------------- --------------- ---------------

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Max Active Clusters                 cluster            0
    Max Cluster Size                      block            8
    Overall GPU Occupancy                     %            0
    Cluster Occupancy                         %            0
    Block Limit SM                        block           32
    Block Limit Registers                 block           21
    Block Limit Shared Mem                block           32
    Block Limit Warps                     block           16
    Theoretical Active Warps per SM        warp           64
    Theoretical Occupancy                     %          100
    Achieved Occupancy                        %        94.50
    Achieved Active Warps Per SM           warp        60.48
    ------------------------------- ----------- ------------

    INF   This kernel's theoretical occupancy is not impacted by any block limit.

Thank you very much for your answer. Currently, this issue has been resolved by calling the CUDA kernel function

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.