I have a weird problem with my code performance. I have still not used shared memory. I am trying to compare the performance of my code with 9kernels vs. 2kernels (increased memory usage) for the SAME code but the performance seems to have reduced in the latter one. Here are my profiler results for the two: CUDA Profiles. Can you please compare the two and let me know why the second one would give me less performance gain over the first one? I know there are lot of global memory accesses in the code but its the same for both (net…same amount of variables need to be accessed) since none use shared memory. For both I have placed some data in the constant memory though and maxrregcount is set to 20. Net branching for the one with 9Kernels is more than net for two.
I am baffled. Can someone please suggest me something?
Thanks in advance.