Code Performance Issues

I have a weird problem with my code performance. I have still not used shared memory. I am trying to compare the performance of my code with 9kernels vs. 2kernels (increased memory usage) for the SAME code but the performance seems to have reduced in the latter one. Here are my profiler results for the two: CUDA Profiles. Can you please compare the two and let me know why the second one would give me less performance gain over the first one? I know there are lot of global memory accesses in the code but its the same for both (net…same amount of variables need to be accessed) since none use shared memory. For both I have placed some data in the constant memory though and maxrregcount is set to 20. Net branching for the one with 9Kernels is more than net for two.

I am baffled. Can someone please suggest me something?

Thanks in advance.

Regards,

Aditi

Please try this link if the above doesn’t work: CUDA Profiles

More code per kernel means more register use. You are limiting the register use, so they spill to local memory, which is SLOW. I would suggest to try without the --maxrregcount option.

I have no idea about the increased branching in your profiler results though.

Christian

Thanks for your reply Christian.

I thought limitation on the registers was linked to occupancy/concurrency. The register requirement is the same for the two codes and also the occupancy is the same (67%). --maxrregcount was set to 20 (max required) so that there is no local memory spills. Would you still think that local memory access could be a reason?

But I think the net number of instructions and net number of branches is smaller in the 2-kernel case than for the 9-kernel case.

There is also a assotiated cost with kernel invocation. Depending on what you are doing some things cross my mind:
1-The kernels job doesn’t require all threads of all block to end doind the work, therefore you do more work per thread with less kernel invocations not making the GPU to wait for more job.
2-The two kernels have less memory acesses to the GPU in total than the 9 kernels (again not sure depends on your code), wich may degrade your performance in the 9 kernels case.