Low or normal performance?

Hi.

I’m surprised about the “low” performance of my very simple test code.

__global__ void kernel_test(unsigned long long *iters) {
    atomicAdd(iters, 1);
}

unsigned long long  *dev_iters = 0;
cudaMallocManaged(&dev_iters, sizeof(unsigned long long));
kernel_test <<< 16*1024*1024,1024 >>> (dev_iters);
cudaDeviceSynchronize();

This is the code brought to a minumum. I get about 1900 million kernel calls/second on this minimalistic code. I’ve seen real complicated kernel code running about 40000 million kernel calls/second.

Doing something else than an atomic call doesn’t change very much.

# /usr/local/cuda/bin/nvprof ./test_cuda
==67986== NVPROF is profiling process 67986, command: ./test_cuda
==67986== Profiling application: ./test_cuda
==67986== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  8.92801s         1  8.92801s  8.92801s  8.92801s  kernel_test(__int64*)
                    0.01%  724.78us         1  724.78us  724.78us  724.78us  setup_kernel(curandStateXORWOW*, int, unsigned long)
      API calls:   98.16%  8.92901s         2  4.46450s  795.42us  8.92821s  cudaDeviceSynchronize
                    1.58%  143.99ms         1  143.99ms  143.99ms  143.99ms  cudaSetDeviceFlags
                    0.22%  20.307ms         4  5.0768ms  6.9430us  20.281ms  cudaMallocManaged
                    0.03%  2.3663ms         2  1.1831ms  11.782us  2.3545ms  cudaLaunchKernel
                    0.00%  347.55us         1  347.55us  347.55us  347.55us  cuDeviceTotalMem
                    0.00%  230.50us       101  2.2820us     240ns  98.123us  cuDeviceGetAttribute
                    0.00%  195.70us         4  48.923us  10.039us  123.94us  cudaFree
                    0.00%  55.964us         1  55.964us  55.964us  55.964us  cuDeviceGetName
                    0.00%  4.9190us         1  4.9190us  4.9190us  4.9190us  cuDeviceGetPCIBusId
                    0.00%  2.4850us         3     828ns     341ns  1.5530us  cuDeviceGetCount
                    0.00%  1.1020us         2     551ns     271ns     831ns  cuDeviceGet
                    0.00%     401ns         1     401ns     401ns     401ns  cuDeviceGetUuid

==67986== Unified Memory profiling result:
Device "GeForce RTX 2080 SUPER (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
       3  21.333KB  4.0000KB  48.000KB  64.00000KB  12.35200us  Device To Host
       1         -         -         -           -  364.0640us  Gpu page fault groups
Total CPU Page faults: 1

Is there something wrong with my test code?

There doesn’t seem to be anything wrong with your code.
The code you run certainly matters. For me, at least, when I change this:

atomicAdd(iters, 1);

to this:

*iters++;

Your code runs a lot (20x) faster. I’m not suggesting that’s a sensible code, but I don’t consider your code to be very sensible. Nevertheless that is roughly the ratio of the two perf numbers you have mentioned. (I personally wouldn’t call this “kernel calls per second” but that’s besides the point.)

And atomics will certainly have a throughput limit.

If you use kernel in the traditional CUDA sense, referring to a __global__ function, you will find that 200,000 kernel calls is the speed of light, because a kernel launch takes 5 microseconds at a minimum (unless something changed with Ampere). So I am not sure what these numbers refer to or what they mean.

The numbers happen to line up with total number of threads executed divided by execution time:

16x1024x1024x1024/8.92801s ~= “1900 Million”

Not suggesting its common parlance or that I would use that sort of characterization, just pointing out an observation.

Shouldn’t this be (*iters)++ ?
With that one, I get about 220.000 million kernel calls/second.
But still not what I expect … see below.

Ok, kernel calls/second might not be the usual parlance (sorry, I’m new in CUDA :), but how shall I phrase my “kernel calls/second” differently/correctly?

Have a look at this here (an example I found):
https://forums.developer.nvidia.com/t/bitslice-des-optimization/38896/48

How is it possible that this software does 23750 MH/s while my simple atomicAdd() code comes up with 1900 (whatever I should call it). Above thread talks about DES, which is far more complicated than one line of atomicAdd() or (*iters)++.

I must be missing something!

Yes, I should have done (*iters)++, and that does indeed slow it down (although its still faster than just the atomic).

I don’t know anything about the performance claims. However a hash is something that can be done entirely “on-chip” on the GPU. Your code (and mine if properly written) are going to do basically one read, and one write, per memory location, and (almost) nothing else.

Suppose that for the code in question, we could read one value, potentially write one or zero values, and for that one value read, perform dozens or thousands of hash operations - perhaps testing each one for a particular condition. The on-chip computational performance of the GPU may “run rings around” the read/write performance (which is the only thing your code or my code is testing).

In short, such a performance claim might be possible.

NVIDIA has a DLI course in development, and while not expressly focused on hashing, uses a hashing workload (in a kernel ) to teach concepts related to copy-compute overlap of CUDA codes (i.e. concurrency).

The particular workload involves a kernel operating on 2^26 64-bit integers, and calling a hashing function 1024 times on each, as part of a cipher operation. Midway through the DLI course, on a V100, the participant achieves a throughput of ~70ms for this workload on the GPU, including the operations of copying those 67M 64-bit integers to the GPU and copying the resultant 67M 64-bit integers back.

If I compute a MH/s number for that code, using the hash definition given above, it achieves 67M*1024 = 67B hashes in ~70ms or about 957,000 MH/s. The point is not to suggest that this is an apples-apples comparison with some DES case quoted somewhere, but to suggest that we cannot simply discount a claim of 20,000 MH/s based on your kernel that does approximately one operation per thread and only achieves 1900 Mops/s.

I am not sure where DES came into the thread, but will point out (25 years after last dealing with DES) that it is very amenable to bit-slice approaches that should perform (extremely) well on GPUs.

that is what I was referring to

that is what I was referring to

I see that now, thanks for pointing me. Do DES implementations for GPUs use atomicAdd, though? I briefly looked at (one or two) CUDA-based DES implementation years ago and do not recall atomicAdd (which doesn’t mean much, as my memory typically isn’t that good or detailed). Unless DES implementations on GPUs use atomicAdd, I do not see how the of DES performance is necessarily relevant to OP’s original question. Which I guess is your point as well …

I checked DLI. They are asking real $$$ for that. Ok … have to think about.

Ok … I explain. No, DES doesn’t use atomicAdd(), but I used this function to initially test the performance of the GPU in general. atomicAdd() isn’t really the best example (1,900 Mops/s) since it wastes cycles for synchronized read/writes, but (*iters)++ (220,000 Mops/s) does already a better job when it comes to getting close to maximum kernel calls/second.

When searching inside this forum for performance values (for reference purpose), I discovered the DES example (link above), which showed 23,750 MHs. That’s how I went from atomicAdd() to DES.

Since you looked at CUDA based DES code, I did the same. I get about 330 MH/s with my implementation, while hashcat talks about 35,100 MH/s. So my DES code is 100 times slower.

I must be missing something.

The DLI course I’m referring to isn’t available yet anyway. It’s under development.

hashcat is open source if I am not mistaken? If so, you can look at the code to figure out how it is achieving high performance. I would be surprised if it is not a using a bit-sliced DES implementation. In any event you should not be discouraged when you are a relative beginner in the field. It usually takes a lot of domain knowledge and software optimization expertise to craft high-performance implementations of anything.

Yes, it’s open source and I’ll dig myself through the source code.

Besides this … while working with nvprof, I get …

nvprof log: /root/nvvp_workspace/.metadata/.plugins/com.nvidia.viper/launch/0/nvprof_8748.log
==8755== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

I know that I can avoid this error by using -unified-memory-profiling off, but then I loose nvprof functionalities.

Why do I get above error and what’s the remedy?

One common reason for this is essentially the one that is indicated in the warning message itself. Did you try that?

When your application exits, the CUDA runtime (i.e. operating system for GPU) needs an opportunity to “finalize” things. A sudden app exit after a kernel launch, for example, will definitely result in this case. Managed memory can exacerbate this.

Make sure at a minimum you are calling cudaDeviceSynchronize() in your app before exit. You can also use cudaProfilerStop() to do something similar as indicated in the warning message.

This is just a guess, of course. I believe there are multiple issues that can lead to this. I don’t have a list of them all. Another possible reason may be profiling scope. The profilers often have trouble with very large profiler scopes, so if you are having trouble with a profiler you may wish to try to reduce the scope of your application (duration, number of kernel launches, number of blocks/threads launched, etc) to see if it helps.

I don’t wish to get into arguments about whether this should happen or not with the profiler. Feel free to file bugs. If you file a bug, at some point you will likely be asked for a complete step-by-step repro case and instructions. I’m simply sharing my experience based on my own observations here.

Indeed, I forgot top mention that I tried cudaProfilerStop(), before and after the cudaFree() calls … no change.

cudaDeviceSynchronize() is used after all kernel launch.

I also reduced the grid,block to 1,1 for testing … no change.

Here the code:

__global__ void setup_kernel(curandState_t *state, int states_n, unsigned long seed) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int idx = tid % states_n;
    curand_init(seed, idx, 0, &state[idx]);
}

int main(int argc, char **argv) {
    ...
    curandState_t *d_states;
    int            d_states_n = 1024;

    cudaMallocManaged((void **) &d_states, d_states_n * sizeof(curandState_t));
    setup_kernel <<< 1,d_states_n >>> (d_states, d_states_n, time(NULL));
    cudaDeviceSynchronize();
    ...
}

The problem is similar (not identical) to this here:

cuda-memcheck doesn’t report any problem with the code.

I found out, that with 2 changes, the error disappears:

  1. I change cudaMallocManaged() to cudaMalloc().
  2. I comment out curand_init().

I looks to me that nvprof doesn’t properly handle managed memory, but that’s hardly possible ?!

root@ctm:/home/geohei/devel# nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.243 (21)

I’d like to start to profile my code, but with this error, I don’t see how I can do it since nvprof is used by virtually every profiling tool.

I can profile your code with no difficulty on V100 and CUDA 11 using nvprof.

nvprof and nvvp really aren’t the profilers you should be using for GPUs starting with Turing moving forward. You should be using nsight compute/nsight systems. There are some “getting started” blogs that may help. Here is the first:

Ok, I followed your advise and dropped my efforts on nvvp and NVIDIA Visual Profiler.

I started with NVIDIA Nsight Systems and got it running on my devel system (Ubuntu 20.04). On Windows, accessing the Ubuntu host using ssh, Nsight Systems crashed reproducibly (just for info - no worries - I stick to Ubuntu).

I manged to run my code through Nsight Systems, but what to do now?
I started here …
https://docs.nvidia.com/nsight-systems/index.html
… digging myself through tons of pages. All very interesting and I learned a lot, but I didn’t find a single document/video which guided me step-by-step through the optimization/profiling process. Basically I was looking for a HOW-TO, but didn’t find anything like that.

There is a video about the (depreciated) NVIDIA Visual Profiler, and that one was pretty cool, however not really adaptable to Nsight Systems.

Any good starting guide / comprehensive guide how to start profiling?

This is the one I know of right now:

We’ll have another one up in a month or 2.

Thanks for the link. Sounded promising … but I encountered problems right from the start.

  1. The current Eclipse version differs from the one in the video - no major issue.
  2. Nature requires adaptations - also managed that one - step 1 finished!

Now I get this one here for step 2:

Error with command: ${cuda_tk_bin:/step_1}/cuda-gdb --version
Cannot run program "${cuda_tk_bin:/step_1}/cuda-gdb": Unknown reason

Replacing ${cuda_tk_bin:/step_1}/cuda-gdbwith /usr/bin/cuda-gdb did the job.
Where is ${cuda_tk_bin:/step_1} definded (so I can change it)?

At the moment, I spent more time getting the devel environment ready rather than coding, optimizing, profiling … :)

… later …

After manual fix in the Eclipse Debug Configuration section, .cuda-gdbinit wasn’t found … ;(
I found out that deleting .cuda-gdbinitdid the job.
Why is not present in the filesystem or why is it entered in the Debug Configuration if not required?

… later …${cuda_tk_bin:/step_1}

After I got the first debug run, the variables didn’t show on the right pane.

I’m under the impression that this tutorial was done from a heavily preconfigured Eclipse/CUDA IDE.