Lunching N times the same kernel


I am profiling an application that lunches the same kernel N times(by including the kernel lunching in a for loop). I am using the range option with ncu and I have added the cudaprofileStart and Stop before and after the for loop. I am surprised that the system memory bandwidth used declines when I increase the number N.
Could someone explain this behavior?

The timing is also multiplied by 100000. Shouldn’t have the same throughput in that case?
The throughput I am getting is 7GB/s how this represents 63% of the peak throughput which is 236 GB/s on a Jetson Xavier ?

Thank you

Are you able to share the 2 ncu report files? What is the value for N in each screenshot? It’s hard to say what’s going on without the full picture.

Yes sure. The value of N is 100,000. You can find the reports attached.
Thanks a lot for your support.
prof_4M_S4_r0_2.ncu-rep (115.7 KB)
prof_4M_S4_r0_loop.ncu-rep (53.3 KB)

Hello again,

Do you have any feedback about it. Actually, I didn’t understand the value in the peak cell. What does it represent? If it represents the bandwidth usage of the main memory, when I have 7GB/s, the 63% does not represent the percentage of maximum memory bandwidth which is 236 GB/s.

If I’m interpreting the results correctly, it looks like your range takes 2.9 sec for 100,000 kernels and 290usec for the single kernel result. This is 10,000 times longer but 100,000 times more work. It could be that profiling only the first kernel, isn’t representative of the activity. Can you try running the single kernel profile but instead of collecting the first kernel, use a “Launch Skip Count” setting of 10 or 100 in the filter options. This way the system is hopefully in more of a steady state. Also, select the “Cache Control” option of “Flush None” because in the range use case the cache isn’t flushed between kernels. Let’s see if that makes the results more comparable. Then we can look at the throughput number more closely.

1 Like

okay thank you, I will try it tomorrow

I am trying to profile using the option “launch-skip” and “cache-control”, however, I am getting an error that no kernel was profiled. After several tests, the kernel is not being profiled when I am putting the “launch-skip” option. For reference, here is the command line for launching ncu:

sudo ncu -o prof_2M_S4_r0_ls10_%i --set full --import-source yes  --launch-skip 100 --cache-control none    ./stress_opt

and here the result obtained:

I tried also with --target-processes all

Thank you

Were you running from the CLI for the first collections as well? Sometimes using sudo can change the environment. Am I understanding correctly that if you run the same command but omit the launch-skip option it runs fine? Can you try with skip count 1, does that have the same result? This one should be without the range, right?

Yes I always run from the CLI.
If I run without sudo, it gives me this error:

Insufficient privileges to launch app for profiling. Launch app with root privileges

Yes true.

I tried launching it with skip count = 1 and I had the same error. And yes without using range.

Thank you foryour support.

When you launch without skip count, do you get one kernel profiled or thousands? The range profile, and your description seem to indicate that 1000s of kernels are called when you launch this application, so I’m trying to figure out why skipping just one kernel profile would result in no kernels profiled.


Sorry for this misunderstanding. Actually I have 2 application:

  1. That launches the kernel 1 time
  2. Launches the kernel 100,000 time.

I tried to apply the skip count for the application that launches the kernel 1 time.

Thank you

Can you try with the skip count on the version that launches thousands?

1 Like


Sorry for the late feedback.

I have tried it with the version that launches it 100,000 times and it worked since I am getting
==PROF== Profiling "memoryKernelSingleSM" - 189: 0%....50%....100% - 76 passes

However getting to 100,000 would take 1 day maybe to profile it.

Update: I tried with a skip count 999999 and it profiled one launch. With that option, I get the same peak% throughput as a single kernel however I am not sure that it means that running the kernel N times I will get the same peak%.

Thank you for your support.

You can use the “–launch-count 1” flag to only profile one kernel, or 2 or 3 etc… if you want to compare a few kernels.

Sorry I didn’t understand. In my case when I skip 99999 I am profiling the last kernel but in which case? after running all the other 99999 ? What about the caches? For the first pass, does the ncu flush all the caches for the first run (in the case of skipping the first 99999 kernels)?

Thank you again.

I am profiling the last kernel but in which case?

I don’t know what you mean by “case” here. You should run the version that launches the kernel 100,000 times. You probably shouldn’t profile the last kernel, just one in the middle after the system reaches a steady state. For example, you could skip 100 then profile 3 with "–launch-skip 100 –launch-count 3 ". The application will still run all the kernels, but only number 101, 102, and 103 will be profiled. Continue turning off the cache flushing so it can match the behavior of the non-profiled run.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.