Kernel bound by instruction and memory latency.

So I’ve basically modified the separable convolution example to work with char inputs rather than float inputs. I then upgrade them to shorts internally, do two passes of convolution (row then column), doing one kernel, then another for each (as I want to do two convolutions overall). I then convert back to chars and output.

I’ve worked my way through the guided profile, and it shows a bar chart of utilization versus compute and memory(shared). I would include a screen shot but it won’t let me here. That’s where I got the figures I quoted (both about 8%). Above the bar chart it says:
“This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of the “Nvidia Tegra X1”. These utilisation levels indicate that the performance of the kernel is most likely limited by the latency of arithmetic or memory operations. Achieved compute throughput and/or memory bandwidth below 60% of peak typically indicates latency issues.”

Where I said “almost all arithmetic operations”, the bar chart has a coloured key and in the compute column, it is comprised of about 7% arithmetic operations, and about 1% memory operations (making the 8% I quoted).

It recommends I perform latency analysis, so I did (by clicking the button provided), and that is where it gives me the other info I gave regarding grid size, block size and occupancy. I’m a bit confused if the occupancy seems good enough, why the utilization seems so low.

The arithmetic operations I perform in the code are just multiplications and additions - as it’s convolution. There is some casting between types - as hinted at above. I’m not sure about the ration of floating point etc you mention. Where would I find that? I’m not actually doing any floating point computations anyway…

Thanks for your help.