Code Performance Issues

Aditi · June 10, 2009, 4:51pm

I have a weird problem with my code performance. I have still not used shared memory. I am trying to compare the performance of my code with 9kernels vs. 2kernels (increased memory usage) for the SAME code but the performance seems to have reduced in the latter one. Here are my profiler results for the two: CUDA Profiles. Can you please compare the two and let me know why the second one would give me less performance gain over the first one? I know there are lot of global memory accesses in the code but its the same for both (net…same amount of variables need to be accessed) since none use shared memory. For both I have placed some data in the constant memory though and maxrregcount is set to 20. Net branching for the one with 9Kernels is more than net for two.

I am baffled. Can someone please suggest me something?

Thanks in advance.

Regards,

Aditi

Aditi · June 10, 2009, 4:58pm

Please try this link if the above doesn’t work: CUDA Profiles

cbuchner1 · June 10, 2009, 5:36pm

More code per kernel means more register use. You are limiting the register use, so they spill to local memory, which is SLOW. I would suggest to try without the --maxrregcount option.

I have no idea about the increased branching in your profiler results though.

Christian

Aditi · June 11, 2009, 4:41pm

Thanks for your reply Christian.

I thought limitation on the registers was linked to occupancy/concurrency. The register requirement is the same for the two codes and also the occupancy is the same (67%). --maxrregcount was set to 20 (max required) so that there is no local memory spills. Would you still think that local memory access could be a reason?

But I think the net number of instructions and net number of branches is smaller in the 2-kernel case than for the 9-kernel case.

nosoul · June 14, 2009, 3:22am

There is also a assotiated cost with kernel invocation. Depending on what you are doing some things cross my mind:
1-The kernels job doesn’t require all threads of all block to end doind the work, therefore you do more work per thread with less kernel invocations not making the GPU to wait for more job.
2-The two kernels have less memory acesses to the GPU in total than the 9 kernels (again not sure depends on your code), wich may degrade your performance in the 9 kernels case.

Topic		Replies	Views
how to reduce registers in each kernel CUDA Programming and Performance	2	1150	November 4, 2009
help me understand `odd' performance CUDA Programming and Performance	5	1720	June 18, 2010
How to improve my kernel execution time? memory bound; occupancy; maxrregcount; cubin; math function CUDA Programming and Performance	0	1993	May 4, 2009
Register demand CUDA Programming and Performance	2	2746	September 9, 2009
reducing the number of used registers CUDA Programming and Performance	8	6374	September 22, 2009
Register allocator overload CUDA Programming and Performance	2	3234	February 10, 2009
Analysing the registers CUDA Programming and Performance	9	1240	March 13, 2012
too large kernel solutions CUDA Programming and Performance	11	4375	September 2, 2008
high register count my kernel uses too many registers CUDA Programming and Performance	6	6091	August 26, 2008
how to reduce the number of registers CUDA Programming and Performance	5	8978	July 8, 2010

Code Performance Issues

Related topics