CUDA Profiler

Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.

I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…

So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).

This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).

I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)

Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.

I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…

So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).

This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).

I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)

Probably the fastest way would be to start with an empty (or only gmem write) kernel and gradually open the original code

lines in the kernel and try to identify what might cause this (mind the dead-code optimizer in this process)

hope it helps…

eyal

Probably the fastest way would be to start with an empty (or only gmem write) kernel and gradually open the original code

lines in the kernel and try to identify what might cause this (mind the dead-code optimizer in this process)

hope it helps…

eyal

When running the profiler, your kernel launches are synchronous as far as I remember, and I don think anything has changed in this respect.

When running the profiler, your kernel launches are synchronous as far as I remember, and I don think anything has changed in this respect.

Well I can confirm the 3ms CPU time outside of the profiler as well… using events and our own internal sub-millisecond-resolution timers. So it’s definitely a problem, both in and outside the profiler.

cuLaunchGridAsync(kernel, 1, 1, stream); // This line alone takes 3ms

=========================================================

I seem to have nailed down the performance issue to the shared memory atomics (atomicAdd(&iterator, 1)).

My kernel has at most 512 atomic adds to the same shared memory integer (in practice it’s about 100-150), at most one per thread just before writing to gmem (uncoalesced).

If I simply write to gmem uncoalesced (the results are useless, but for testing purposes) I get closer to 200us CPU time… So it seems each smem atomicAdd is taking 18us+, which is kind of scary (I thought they were a fair bit faster than that, but obviously now).

Guess I’m going to have to find a way to implement this function without atomics :\ (need to compact a list, where some elements are potentially ‘invalid’, such that the list is the subset of the original without invalid elements, in the same order they appear in the original)

BUT far more importantly, why aren’t my kernels asynchronous (outside of the profiler)? it should NOT take 3ms to launch a kernel into a stream… surely?

Unless all of the above assumptions are invalid, and for some reason using atomics simply increases the launch overhead of kernels (astronomically)?

Well I can confirm the 3ms CPU time outside of the profiler as well… using events and our own internal sub-millisecond-resolution timers. So it’s definitely a problem, both in and outside the profiler.

cuLaunchGridAsync(kernel, 1, 1, stream); // This line alone takes 3ms

=========================================================

I seem to have nailed down the performance issue to the shared memory atomics (atomicAdd(&iterator, 1)).

My kernel has at most 512 atomic adds to the same shared memory integer (in practice it’s about 100-150), at most one per thread just before writing to gmem (uncoalesced).

If I simply write to gmem uncoalesced (the results are useless, but for testing purposes) I get closer to 200us CPU time… So it seems each smem atomicAdd is taking 18us+, which is kind of scary (I thought they were a fair bit faster than that, but obviously now).

Guess I’m going to have to find a way to implement this function without atomics :\ (need to compact a list, where some elements are potentially ‘invalid’, such that the list is the subset of the original without invalid elements, in the same order they appear in the original)

BUT far more importantly, why aren’t my kernels asynchronous (outside of the profiler)? it should NOT take 3ms to launch a kernel into a stream… surely?

Unless all of the above assumptions are invalid, and for some reason using atomics simply increases the launch overhead of kernels (astronomically)?