Commandline Profiling Trying to use the profiler via the command line


Crossposting the same topic from the OpenCL forum - hoping to get more eyeballs on the topic and hopefully a solution.

I am trying to use the commandline profiler for OpenCL. IF you want to use the profiler via the commandline for CUDA, just replace OPENCL with CUDA in the environment variables given in the following link (ie OPENCL_PROFILE = 1 => CUDA_PROFILE = 1 etc)

I am trying to use the commandline profiler to profile my OpenCL code. I am using it in as suggested in the following link :

I used the latest version of Visual Profiler and dumped the data into a csv to see what all profiler data can be obtained. This is the list of all the parameters

I then put the above list in OPENCL_PROFILE_CONFIG file and run the executable multiple times to profile different parameters at different runs.

The problem I have is that I cannot get all the parameters. The profiler simply outputs invalid profiler option for some of them

These are the ones for which I cannot obtain the data outright

I don’t want to use the Visual Profiler because I want to automate the whole procedure.

Anyone has any ideas on how to obtain this data ? Specifically, what should be written in config file for the profiler to profile this data .

Are you using NVIDIA runtime?
I had the same problem before. You can try using cuda keyword instead of using opencl keywords, which you can find in the Compute_Profiler.txt
For example “gridsize” and “threadblocksize” will be replaced with “ndrangesizeX, ndrangesizeY, ndrangesizeZ, workgroupsizeX, workgroupsizeY, workgroupsizeZ” or so.
Hope this helps.

Thanks. I had actually not looked into doc folder of the visual profiler. All the information I required was in the pdf. Your suggestion prompted to look into the document folder.

there is still one unresolved issue.

When I try to profile tex1_cache_sector_queries / tex1_cache_sector_misses - I get this as an invalid config option. I am using a Tesla C2050 (Fermi architecture 2.0) and CUDA 4.0 / OpenCL latest version - Is this an architecture limitation or something else ?