CUDA Profiler

Smokey · October 18, 2010, 3:16am

Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.

I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…

So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).

This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).

I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)

Smokey · October 18, 2010, 3:16am

Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.

I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…

So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).

This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).

I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)

eyalhir74 · October 18, 2010, 9:03am

Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.

I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…

So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).

This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).

I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)

Probably the fastest way would be to start with an empty (or only gmem write) kernel and gradually open the original code

lines in the kernel and try to identify what might cause this (mind the dead-code optimizer in this process)

hope it helps…

eyal

eyalhir74 · October 18, 2010, 9:03am

Okay, so I’m trying to profile a kernel which really should be quite blazing fast… and having profiled it, the profiler agrees.

I see a GPU Time of 57-58 microseconds (very consistent), however a CPU time of 3 milliseconds… roughly 50x that of the GPU time…

So I’m quite concerned as to why it’s taking 3 milliseconds for the CPU to launch this kernel (not execute, remember - kernel launches are asynchronous).

This is running on a Fermi card, using 512 threads (single block) for the kernel, 2kb of shared memory… nothing special at all. In fact it uses less resources than chunkier kernels which have more GPU time, but less CPU time (~300 micro seconds).

I’m really confused here, and I need to fix this problem asap - as it’s bringing down the performance of our app dramatically (the 3ms “CPU Time” is almost the entire budget of our app each iteration - and this one simple kernel is using it all up) (!)

Probably the fastest way would be to start with an empty (or only gmem write) kernel and gradually open the original code

lines in the kernel and try to identify what might cause this (mind the dead-code optimizer in this process)

hope it helps…

eyal

E.D_Riedijk · October 18, 2010, 9:59am

When running the profiler, your kernel launches are synchronous as far as I remember, and I don think anything has changed in this respect.

E.D_Riedijk · October 18, 2010, 9:59am

When running the profiler, your kernel launches are synchronous as far as I remember, and I don think anything has changed in this respect.

Smokey · October 18, 2010, 10:44pm

Well I can confirm the 3ms CPU time outside of the profiler as well… using events and our own internal sub-millisecond-resolution timers. So it’s definitely a problem, both in and outside the profiler.

cuLaunchGridAsync(kernel, 1, 1, stream); // This line alone takes 3ms

=========================================================

I seem to have nailed down the performance issue to the shared memory atomics (atomicAdd(&iterator, 1)).

My kernel has at most 512 atomic adds to the same shared memory integer (in practice it’s about 100-150), at most one per thread just before writing to gmem (uncoalesced).

If I simply write to gmem uncoalesced (the results are useless, but for testing purposes) I get closer to 200us CPU time… So it seems each smem atomicAdd is taking 18us+, which is kind of scary (I thought they were a fair bit faster than that, but obviously now).

Guess I’m going to have to find a way to implement this function without atomics :\ (need to compact a list, where some elements are potentially ‘invalid’, such that the list is the subset of the original without invalid elements, in the same order they appear in the original)

BUT far more importantly, why aren’t my kernels asynchronous (outside of the profiler)? it should NOT take 3ms to launch a kernel into a stream… surely?

Unless all of the above assumptions are invalid, and for some reason using atomics simply increases the launch overhead of kernels (astronomically)?

Smokey · October 18, 2010, 10:44pm

Well I can confirm the 3ms CPU time outside of the profiler as well… using events and our own internal sub-millisecond-resolution timers. So it’s definitely a problem, both in and outside the profiler.

cuLaunchGridAsync(kernel, 1, 1, stream); // This line alone takes 3ms

=========================================================

I seem to have nailed down the performance issue to the shared memory atomics (atomicAdd(&iterator, 1)).

My kernel has at most 512 atomic adds to the same shared memory integer (in practice it’s about 100-150), at most one per thread just before writing to gmem (uncoalesced).

If I simply write to gmem uncoalesced (the results are useless, but for testing purposes) I get closer to 200us CPU time… So it seems each smem atomicAdd is taking 18us+, which is kind of scary (I thought they were a fair bit faster than that, but obviously now).

Guess I’m going to have to find a way to implement this function without atomics :\ (need to compact a list, where some elements are potentially ‘invalid’, such that the list is the subset of the original without invalid elements, in the same order they appear in the original)

BUT far more importantly, why aren’t my kernels asynchronous (outside of the profiler)? it should NOT take 3ms to launch a kernel into a stream… surely?

Unless all of the above assumptions are invalid, and for some reason using atomics simply increases the launch overhead of kernels (astronomically)?

Topic		Replies	Views
Diff. between CPU / GPU kernel execution time CUDA Programming and Performance	4	1698	May 18, 2010
Kernel execution blocks CPU code CUDA Programming and Performance	9	4038	September 8, 2009
Overlapping GPU and CPU computation? CUDA Programming and Performance	9	1303	November 19, 2010
Issue with running CPU and GPU code Asynchronously CUDA Programming and Performance	0	3190	June 8, 2011
Profiler, GPU/CPU time CUDA Programming and Performance	0	2578	January 29, 2009
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10676	June 21, 2009
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	1	711	July 7, 2011
Kernel Launch Time (CPU Time) Reported in Visual Profiler how to optimize kernel launch CUDA Programming and Performance	0	3756	January 13, 2011
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20318	May 4, 2007
Profiler timings vs. real world timings. VERY different... CUDA Programming and Performance	8	2507	May 15, 2009

CUDA Profiler

Related topics