Why is the profiler screwy?

Played with the profiler today. The uncoallesced global loads counter is not working right (on my 8600). By enabling a single line in my large kernel, the profiler jumps from very few uncoallesced loads to 100k (vs 300k coallesced). The profiler also starts reporting all these warp serialization and divergent branches. Not only does this result not make sense given the source code, but the total runtime of the kernel changes by only a percent or two as a result of uncommenting this line. What’s going on with the profiler? To what extent can it be trusted?

can it be due to the fact that only 1 multiprocessor is profiled? There are some remarks about this in the releasenotes. I believe it was a bit about a minimum amount of blocks needed. Also, I think it is mentioned that you should take the profiler output as a guide if you are going in the right or wrong direction with your optimizations, not as a hard measure

You put me on the right track.

When I feed in uniform data, I have zero divergences, 320k coallesced reads, 10k uncoallesced. When I feed in real, varied data, I have 1300 divergent branches, 260k coallesced reads, and 950k uncoallesced. And 30% worse performance.

Ok, I can infer a few things: uncoallesced reads should be divided by 16 when comparing to coallesced reads, and both should be divided by 32 to get a source-code-level value.

BUT I think something is still amiss: are these really uncoallesced reads?? My performance doesn’t drop by enough, and in fact it’d be trivial for the GPU to actually send a coallesced read in case of divergence and throw out unneeded numbers. I’ll bet G200 does this, but it’s quite possible G80 will too. Remember, the profiler has no knowledge of any coallescing hardware. On G200 it just turns itself off. If G80 is able of any auto-coallescing, the profiler will probably not work right either.

P.S. The stuff about “using it as a guide” was just in the section of “we’re not gonna tell you the architectural details of what this all means, so just squint your eyes and try to use it anyway.” But in this case it seems the profiler is completely off and can’t be used for judging the “right direction” too well either.

Or i dunno, maybe the uncoallesced reads are there. The instruction count increases only 6%. That’s a direct measure of the impact of the divergent branches, right? So that’s not much. However I’m only running 64 threads per MP and really not hiding my latency. If uncoallesced reads don’t impact latency so much as the bandwidth, then the 30% becomes accounted-for. I should see big jumps with more threads, which I’ll look for.

Now, any ideas to fix the non-coallescence? Is it possible to globally force predication instead of branching?

From the programming guide:

This is true for CM1.0.

I was right, coallesced accesses in diverged threads are still coallesced and the profiler doesn’t know the difference!

Well, pm someone like mfatica with a bugreport :)