Hi.
I’m surprised about the “low” performance of my very simple test code.
__global__ void kernel_test(unsigned long long *iters) {
atomicAdd(iters, 1);
}
unsigned long long *dev_iters = 0;
cudaMallocManaged(&dev_iters, sizeof(unsigned long long));
kernel_test <<< 16*1024*1024,1024 >>> (dev_iters);
cudaDeviceSynchronize();
This is the code brought to a minumum. I get about 1900 million kernel calls/second on this minimalistic code. I’ve seen real complicated kernel code running about 40000 million kernel calls/second.
Doing something else than an atomic call doesn’t change very much.
# /usr/local/cuda/bin/nvprof ./test_cuda
==67986== NVPROF is profiling process 67986, command: ./test_cuda
==67986== Profiling application: ./test_cuda
==67986== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.99% 8.92801s 1 8.92801s 8.92801s 8.92801s kernel_test(__int64*)
0.01% 724.78us 1 724.78us 724.78us 724.78us setup_kernel(curandStateXORWOW*, int, unsigned long)
API calls: 98.16% 8.92901s 2 4.46450s 795.42us 8.92821s cudaDeviceSynchronize
1.58% 143.99ms 1 143.99ms 143.99ms 143.99ms cudaSetDeviceFlags
0.22% 20.307ms 4 5.0768ms 6.9430us 20.281ms cudaMallocManaged
0.03% 2.3663ms 2 1.1831ms 11.782us 2.3545ms cudaLaunchKernel
0.00% 347.55us 1 347.55us 347.55us 347.55us cuDeviceTotalMem
0.00% 230.50us 101 2.2820us 240ns 98.123us cuDeviceGetAttribute
0.00% 195.70us 4 48.923us 10.039us 123.94us cudaFree
0.00% 55.964us 1 55.964us 55.964us 55.964us cuDeviceGetName
0.00% 4.9190us 1 4.9190us 4.9190us 4.9190us cuDeviceGetPCIBusId
0.00% 2.4850us 3 828ns 341ns 1.5530us cuDeviceGetCount
0.00% 1.1020us 2 551ns 271ns 831ns cuDeviceGet
0.00% 401ns 1 401ns 401ns 401ns cuDeviceGetUuid
==67986== Unified Memory profiling result:
Device "GeForce RTX 2080 SUPER (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
3 21.333KB 4.0000KB 48.000KB 64.00000KB 12.35200us Device To Host
1 - - - - 364.0640us Gpu page fault groups
Total CPU Page faults: 1
Is there something wrong with my test code?