I’m using CUDA 2.1 under Fedora 10 with a GTX295. I have some questions about cudaprof 1.1.
First, in the “Session Settings” window, the checkboxes for “gld uncoalesced” and “gst uncoalesced” are disabled, and I can’t find any way to make them be enabled. Is this a feature that hasn’t been implemented yet, or is there some way I can get it to count uncoalesced reads and writes?
Second, what is the meaning of the “cta launched” value? The manual gives the completely unhelpful definition, “Number of CTAs launched on the PM TPC.” Since the CUDA programming guide makes no mention of what a “CTA” or “TPC” is, I have no idea what this means.
Those counters are not active on G200 based GPUs. NVIDIA still hasn’t updated them to work. I would guess that the profiler has detected that you are on a G200 GPU and has deactivated them, though I’m not positive on that one.
This is just a different lingo that the graphics and really low level hardware guys use.
CTA = block
TPC = texture processing cluster (group of two or three multiprocessors)
PM = performance measured??? I don’t know on this one
Basically, ignoring the lingo only one multiprocessor has the performance counters. So, if you launch 1000 blocks and your calculation load is pretty even: cta_launched should average out around 1000/30. It really isn’t that useful of a counter unless you want to calculate a standard deviation or something and see how poorly your load is balanced among blocks.
Thanks! I also have access to another computer with an 8800M running Windows XP, so I can do my profiling on that computer if necessary.
So I just installed the Cuda Profiler on that computer and tried it out. Sure enough, those checkboxes are enabled. But when I actually run my program, the only columns that appear in the output are Method, GPU Time, CPU Time, and Occupancy. That’s it. I’ve checked every check box there is to check in the Session Settings window, but none of the other values are actually getting reported. What am I missing?
Odd. You should be at least getting something. Check the status window at the bottom. For any column that is reported as all 0’s, the profiler removes it from the main window display and prints something like "column ‘local load’ having all zero values is hidden.
Unless you kernel reads/writes nothing from/to global mem, there has to at least be something non-zero in some of the counters.
Open the session settings and double check that all the boxes are checked before clicking play. I just opened the profiler to test on my system and loaded a previous project. The session settings had gone back to the default for just GPU time CPU time and occupancy.
Open session settings and check all the checkboxes.
Click OK.
Bring up session settings again and verify that all the checkboxes are still checked.
Click Start. It runs and produces only the four columns I mentioned.
Bring up session settings yet again. None of the checkboxes are checked anymore.
Curiously, if I click OK and then start it from the main window, rather than clicking Start in Session Settings, the checkboxes do not get cleared. But it still only produces those columns.
Well, I’m about out of ideas. The only thing I’ve got left is: your app isn’t muti-GPU perchance? The v1.0 profiler would get data parsing errors on applications that opened up more than one GPU context. Maybe they “fixed” the problem in 1.1 by turning off all the counters if this is detected.
No, just one GPU in that machine and the app only uses one. And profiling the same application under Linux works fine (except that I don’t get counts for uncoalesced loads and stores).
Oh well, thanks for your help. It’s becoming clear to me that CUDA still isn’t as mature as one might like. :(
Actually the profiler is one of those things that is working quite well.
Does your program run more than once or only one time?
When selecting all counters your program should run 3 times to profile all off them. It might be that you have your maximum runtime too low and the profiler quits after that maximum time. Then it will only have data from that first run.
It runs three times. Each run takes only about five seconds, and all of them complete successfully. I have the maximum execution time set at 30 seconds.
It really seems like the other statistics just aren’t being generated. Maybe it’s just a coincidence, but it does seem striking that the only four columns that appear are the ones that you get in a profile log by setting CUDA_PROFILE=1. Are there different mechanisms used to generate different statistics? It seems like only one of those mechanisms is working.