Compute Profiler - Confused by Work Group Information

I have just started using the Compute Profiler for OpenCL, and I’m a little bit baffled by the numbers I’m seeing in the Kernel Table.

nd range size [9 9]
work group size [8 8 1]
local work group size 1

There are a few things I don’t understand:

  1. My clEnqueueNDRangeKernel() call supplies a 3D range: {9, 9, 64}. I realise that CUDA doesn’t support 3D ranges, but the NVIDIA OpenCL implementation seems to allow it. Am I mistaken?
  2. What is the “local” work group size, and why is it 1 for all of my kernels?

I find it hard to believe that nobody else has encountered these problems.

I’m most concerned by what “local work group size” means.

/usr/local/cuda/doc/Compute_Profiler.txt says:

The keywords of the low-level profiler and the column titles in the visual profiler are not fully identical, but I believe they coincide with the CSV export titles.

PS: If you are not sure what this means look into the OpenCL specs for the parameters to enqueueNDRangeKernel…

PPS: What starts with RT and ends with FM?

The Compute Visual Profiler’s help file only has this to say on the matter:

“The kernel option ‘localworkgroupsize’ is valid only for OpenCL. If this option is selected for a CUDA program a column ‘localblocksize’ is added to the profiler table, but this column is hidden by default.”

In my past experiences with the profiler, the explanation of profiler counters provided by this help file have been adequate – and I wasn’t aware that there was extra/different documentation available in the /doc directory.

I don’t appreciate your tone, but thank you for bringing the /doc directory to my attention.