Kernel bound by instruction and memory latency.


I’m still new to this, so please forgive my ignorance…
I have modified one of the samples quite a lot (separable convolution) and am profiling my first cuda application.

The profiler tells me my kernel performance is bound by instruction and memory latency. The compute graph shows about 8% (almost all arithmetic operations, a tiny bit memory operations), and the memory (L2 Cache) shows slightly less at about 7% utilization.

Hence I performed latency analysis as recommended by the profiling guide. It then says that occupancy is not limiting kernel performance. It says that this should be a good way of improving instruction and memory latency?

It may be helpful to note the following:
I am using Jetson TX1 (so Maxwell I think), grid size [80,15,1] (1200 blocks) block size [16,8,1] (128 threads).
Occupancy per SM:
Active blocks 16 out of max 32.
Active warps 45.54 out of max 64.
Active threads max of 2048.
Occupancy: 71.2%

Any guidance would be much appreciated. Many thanks!

Your grid size, block size, occupancy all look good to me, i.e. conducive to good performance. From your description I can’t tell what your code is doing and what bounds it performance based on the description provided. The number of threads running concurrently should be sufficient to cover basic memory latencies.

If the profiler claims that is not the case, my followup question would be: Is there thread divergence that leads to many thread being inactive? Is the memory access pattern “random”, leading to many non-coalesced memory accesses?

What does this mean? 8% of what? Does “almost all arithmetic operations” mean your code is compute bound? What kind of arithmetic operations are these? What is the ratio of floating-point operations versus bytes consumed per unit time?

So I’ve basically modified the separable convolution example to work with char inputs rather than float inputs. I then upgrade them to shorts internally, do two passes of convolution (row then column), doing one kernel, then another for each (as I want to do two convolutions overall). I then convert back to chars and output.

I’ve worked my way through the guided profile, and it shows a bar chart of utilization versus compute and memory(shared). I would include a screen shot but it won’t let me here. That’s where I got the figures I quoted (both about 8%). Above the bar chart it says:
“This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of the “Nvidia Tegra X1”. These utilisation levels indicate that the performance of the kernel is most likely limited by the latency of arithmetic or memory operations. Achieved compute throughput and/or memory bandwidth below 60% of peak typically indicates latency issues.”

Where I said “almost all arithmetic operations”, the bar chart has a coloured key and in the compute column, it is comprised of about 7% arithmetic operations, and about 1% memory operations (making the 8% I quoted).

It recommends I perform latency analysis, so I did (by clicking the button provided), and that is where it gives me the other info I gave regarding grid size, block size and occupancy. I’m a bit confused if the occupancy seems good enough, why the utilization seems so low.

The arithmetic operations I perform in the code are just multiplications and additions - as it’s convolution. There is some casting between types - as hinted at above. I’m not sure about the ration of floating point etc you mention. Where would I find that? I’m not actually doing any floating point computations anyway…

Thanks for your help.

Accessing memory in chunks smaller than four bytes is pretty much never a good idea (it causes performance issues). Try processing the data four bytes at a time, reading it as uchar4. By the type promotion rules of C++, when integer data smaller than ‘int’ enters into an expression, it is first widened to ‘int’, so processing the data as ‘short’ may likewise be a bad idea, potentially causing even more conversions.

I do not know what to make of the utilization numbers, I agree it seems odd that the utilization is low while the occupancy is good.