reasons why splitting large kernel to smaller one lower perfromance

I try to understand what might cause the overhead in my kernel launch

Each kernel consist of 1024 threads and does a lot of work, for example each thread process 1000 data sets for approx 100 seconds.

After breaking the kernel to smaller pieces. (x10 smaller), instead of launching one kernel i launch ten kernels one by one.
the total overhead(for 10 kernel launch and memory copy) is small, much less then one second(i confirmed by test).

The total processing time should be around 100 seconds,
but it is slower by almost 10%

The code base is large and the kernel footprint is large as well. but still it should not have such effect the the total time.

I am trying to find the problem, using divide and conquer, but i think it is not confined to single line\function with no success for now.

Any ideas what it might be? or how should i attack this problem ?

It is impossible to diagnose this based on the scant information provided. The reason behind your observation could be anything from occupancy issues to details of code generation to efficiency of data movement, or even just your timing methodology.

I would strongly suggest to familiarize yourself with the CUDA profiler and let it guide your optimization efforts. It is not clear how you determined that breaking the original kernel into ten pieces might be a good alternative. Maybe some other partitioning would be more advantageous, or the monolithic kernel approach might be best after all.

If you mean 1024 threads (as opposed to 1024 threads per block) then this is a tiny little kernel. A key aspect of GPU acceleration is latency hiding. It would be quite rare for a kernel of 1024 threads to be able to do a good job of latency hiding, so I would expect latency (perhaps in several forms including arithmetic as well as memory latency) to be an issue for such a kernel. Subdividing such a kernel would not be the correct approach to improve performance.

As njuffa said, starting with a profiler is a good idea. There are good presentations on analysis-driven optimization, that are specific to CUDA. For example google this: “paulius analysis optimization gtc” and you will quickly find a few.

In a nutshell, you could start out with nvprof, and start to get a feel for some of the metrics. At a high level, you want to assess whether a kernel is compute-bound, memory-bound, or latency-bound. In particular, you could start with the gld_efficiency and sm_efficiency metrics as very high-level steering indicators to decide on the next steps.

The results of this analysis can/should suggest optimization strategies. For example, if the gld_efficiency metric is low (significantly less than 50%), you would want to analyze global memory access patterns for efficiency (e.g. coalescing). If you do your analysis on a modern (sm_52) architecture, you can even let the profiler trace efficiency down to a particular line of source code. But even without that, hopefully you have enough understanding of the code to figure out where the global load traffic is originating.

Likewise, if the sm_efficiency metric is low (significantly less than 90%), then to some degree your performance is bound by latency. The usual recipe here is to try to “expose more parallelism” possibly via increase in ILP, work per thread, occupancy, and/or simply increasing the number of threads in your grid. As a simple example related to your description, you mention that each thread processes 1000 data sets. Can this be further parallelized? Is there any inherent sequential dependency in the processing of these 1000 data sets per thread? If not, try increasing your total thread count (in the grid) by a factor of 10 and have each thread process only 100 data sets instead of 1000.

If your kernel is actually only 1024 threads, I’d be very surprised if it had a good sm_efficiency number, if you are running on anything but the smallest of GPUs.

My rule of thumb that I suggest to others is that problem sizes of less than 10,000 threads will often be latency-bound, and that number will increase as you move to larger GPUs and newer architectures.

As an additional comment, if the reason you have limited your kernel to 1024 threads is that you want to be able to do a kernel launch like this:


then you are seriously underutilizing almost any CUDA GPU available. Kernel launches like this:




are serious performance limiters in CUDA. If you ever sit down to write such a code where you care about performance, you should immediately stop and rethink things.

First if all thank you all for your support especially njuffa and txbob, you have have always giving insights (in previous topics as well), and i didn’t thank you enough before.

Here are some more info regards your answers.

i am familiar with the cuda profiler and i’v already done a lot of improvement to the code, currently the bottleneck is memory coalescing, because the code is very complex and large, moving the data to coalescing design is a working progress, (with each change three is a performance improvement) ( i have been working with it for some time and with some projects i have already reach to high utilization (over 80% commutation and memory) )

The 1024 threads is just an example, as i can run multiple streams each have 1024, but the lower performance occurs in both scenarios, when running 10 streams with 1024 treads each or just a single stream with 1024 treads.
i can merge 10 streams into one large stream with 10k threads, but it does not change anything. (we can run much more then 10, it is just an issue of the available memory because each thread require some memory, currently we have some open bugs regrading it with nvida)

That is a nice idea ill try to get a card (with sm_52) and do some testing, it might save me some time to find the hot spots.

No, because each data set require calculation of all previous data sets, each tread must calculate all data in order, what you suggest will allow to utilize higher percentage of the GPU capabilities, but not improve single thread performance.

Currently each thread have more then enough work to do, we still have a limit of the amount of threads we can launch
(each treads require fixed amount of data to work, and because of some bugs (i think in the compiler) there is not enough of free memory to utilize the full performance of the GPU, this will eventually will be fixed).
so for now ill try to improve the performance per single thread or more like pack of 1024 treads.

We still have some issues\bugs i think found the in compiler that have some effect, hopefully after some legal papers, ill will be allowed to send a test case (based on portions of our code) regards some issues i have posted in the last months here, however these issues does not effect the performance,

while writing this reply i have tried to make it simpler by analyzing the same test with one thread, and got exactly the same performance when braking the kernel to 10 pieces or to 100 pieces
taking apart of utilizing the entire potential of the card, because it is not tested here, i still cannot understand what can cause the performance to become slower when ruining 10 kernels each process 100 data. perhaps this performance issue will be gone when ill finish with the memory coalescing. but it does not make scene .

If you’re using 10 streams then you might want to try altering the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable.

It’s described here and here.

Caveat: I’ve never seen this environment variable impact performance. Maybe someone else can explain when it actually does improve performance.

Later… tracing the concurrentKernels CUDA sample app reveals that setting CUDA_DEVICE_MAX_CONNECTIONS can alter the number of concurrent kernels being launched but doesn’t appear to be a useful knob in this context.

Here are some observations on a Win10/x64 + GTX 980 (sm_52) workstation:

>nvprof --print-gpu-trace concurrentKernels.exe --device=0 --nkernels=64
  • set CUDA_DEVICE_MAX_CONNECTIONS=1 -- 3 sequential grid cohorts
  • set CUDA_DEVICE_MAX_CONNECTIONS=2 -- 4 sequential grid cohorts
  • set CUDA_DEVICE_MAX_CONNECTIONS=4 -- 2 sequential grid cohorts (32 concurrent kernels)
  • set CUDA_DEVICE_MAX_CONNECTIONS=8 -- 2 sequential grid cohorts (32 concurrent kernels)
  • set CUDA_DEVICE_MAX_CONNECTIONS=16 -- 2 sequential grid cohorts (32 concurrent kernels)
  • set CUDA_DEVICE_MAX_CONNECTIONS=32 -- 2 sequential grid cohorts (32 concurrent kernels)

Summary: Don’t bother setting CUDA_DEVICE_MAX_CONNECTIONS.

Other observations: a K620 and GTX 750 Ti are sm_50 devices and will launch up to 16 concurrent kernels.

So is CUDA_DEVICE_MAX_CONNECTIONS now a vestigial/cargo-cult setting?