I have a simple program, I just want to verify my GPU real performance. but, its result is out of my expectation. I don’t know how to explain it, and how to optimize my program. So, I hope NV’s experts can help me.
the detail about my GPU as following:
Device 0: “NVIDIA RTX A4000”
CUDA Driver Version / Runtime Version 11.6 / 11.3
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 16109 MBytes (16891379712 bytes)
(48) Multiprocessors, (128) CUDA Cores/MP: 6144 CUDA Cores
GPU Max Clock rate: 1560 MHz (1.56 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
ok, let me introduce my program, my program is very very simple, just execut “ffma”, details as following:
- each thread process 4 floats(float4);
- there are 256 threads in each block;
- there are 2 blocks in each SM;
- my formular in each thread is very simple: C += A * B ;
- each thread just read 4 floats for A from global_memory, just read 4 floats for B from global, and write 4 floats for C into global_memory;
- each thread repeat above formular for 2048 rounds, liks this:
for(int i = 0 ; i < 2048 ; i ++) C += A * B. - I implement the loop body with ptx, I think it can avoid the optimization behavior of nvcc. I just want to avoid nvcc optimizing my code like this: C = 2048 * (A * B)
- I define 8 float4 in every thread to avoid the dependency about C, like this:
float4 A = read_f4(ptr_A), B = read_f4(ptr_B)
float4 C0, C1, C2, C3, C4, C5, C6, C7;
loop = 2048 >> 5;
for(int i = 0 ; i < loop ; i ++){
C0 += A * B;
C1 += A * B;
C2 += A * B;
C3 += A * B;
C4 += A * B;
C5 += A * B;
C6 += A * B;
C7 += A * B;
C0 += A * B;
C1 += A * B;
C2 += A * B;
C3 += A * B;
C4 += A * B;
C5 += A * B;
C6 += A * B;
C7 += A * B;
C0 += A * B;
C1 += A * B;
C2 += A * B;
C3 += A * B;
C4 += A * B;
C5 += A * B;
C6 += A * B;
C7 += A * B;
C0 += A * B;
C1 += A * B;
C2 += A * B;
C3 += A * B;
C4 += A * B;
C5 += A * B;
C6 += A * B;
C7 += A * B;
}
C0 += C1;
C2 += C3;
C4 += C5;
C6 += C7;
C0 += C2;
C4 += C6;
C0 += C4;
store_f4(ptr_C, C0);
the performance should be ~10Tflops(x2=20Tflops),but the result of my program is about 6.5T, just about 65% peak performance.
I modified the blocksPerSM to 2, 4, 8, and I modified the threadPerBlock to 128/256/512。 unfortunately, these results are very similar, about 60% - 65%。
and then, I profiled my program with NCU,it tell me “The ratio of peak float (fp32) to double (fp64) performance on this device is 64:1. The kernel achieved 61% of this device’s fp32 peak performance and 0% of its fp64 peak performance.” in “Roofline Analysis”
I think my program have avoid memory-access, avoid register dependency, I don’t know why my peak performance is 61%
I tried to read the profile information in NCU,but, I found I still cannot found the reason of poor performance.
I’ve uploaded my program and the profile file from NCU,
base_mac.tar.gz (3.7 KB)
repoprt.ncu-rep (12.6 MB)
Is there anyone would like to teach me? I think the keypoint is reading the profile file, but I cannot understand them, is there anyone would like to help me?