I am trying to figure out at what point as I increase my simulation size does global memory access bandwidth become a limiting factor. The problem is, CUDA profiler freezes with increasing liklihood the larger I make my simulation. For run times longer than 1 minute, there is virtually no chance the profiler will finish without freezing. Is this a known problem? This is extremely frustrating and I need this information as soon as possible.
Bumpety bump bump.
My approach to measuring performance is to use the API calls that tell me the starting and ending time of a kernel. By running the kernel several times and dumping the performance values from the main program via printf’s or cout’s to the console, I get information that describes how the kernel performs varying the size of the inputs or some parameter, etc.
Here is an example:
Device Name, Size (bytes), Width (pixels), Best runtime (ms), Average runtime (ms), Worst runtime (ms),Average Perf (bytes/s)
Tesla C2050, 262144, 256, 0.111, 0.112, 0.116, 2.34E+09
Tesla C2050, 414736, 322, 0.128, 0.135, 0.141, 3.07E+09
Tesla C2050, 659344, 406, 0.232, 0.232, 0.233, 2.84E+09
Tesla C2050, 1048576, 512, 0.308, 0.308, 0.308, 3.40E+09
Tesla C2050, 1664100, 645, 0.393, 0.393, 0.395, 4.23E+09
Tesla C2050, 2637376, 812, 0.867, 0.868, 0.870, 3.04E+09
Tesla C2050, 4194304, 1024, 1.181, 1.181, 1.182, 3.55E+09
Tesla C2050, 6656400, 1290, 1.771, 1.773, 1.778, 3.75E+09
Tesla C2050, 10562500, 1625, 2.467, 2.470, 2.472, 4.28E+09
Tesla C2050, 16777216, 2048, 3.021, 3.022, 3.024, 5.55E+09
Tesla C2050, 26625600, 2580, 5.116, 5.118, 5.119, 5.20E+09
Tesla C2050, 42250000, 3250, 7.647, 7.648, 7.650, 5.52E+09
Tesla C2050, 67108864, 4096, 9.925, 9.930, 9.933, 6.76E+09
Tesla C2050, 106502400, 5160, 16.941, 16.944, 16.946, 6.29E+09
Tesla C2050, 169052004, 6501, 22.806, 22.916, 22.987, 7.38E+09
Tesla C2050, 268435456, 8192, 39.718, 39.757, 39.826, 6.75E+09
Graphing the resulting performance in terms of time vs size yields this graph:
Graphing the resulting performance in terms of GB/s vs size yields this graph:
Yes, this isn’t exactly a profiler, but it helps to explore running the problem and find out specific issues with running a kernel at specific data sizes, and observing the numbers can help determine how scalable a given algorithm is when running on a given GPU.
And even if the profiler doesn’t work on your problem, if your problem is still running without the profiler, you can always measure its performance with the CPU’s timer if the GPU timing code isn’t working. You just need to run the code multiple times and compare the best and worst times to figure out the variance and determine how accurate and precise your measurements are (obviously, using CPU wall clock time vs GPU kernel time measures other overhead in the system, but this overhead isn’t optional, so it’s useful to include it in your performance measurements).
I hope this information is useful.
-Mike