Device to L2 memory less than L2 to L1

Looking at the attached figure, I can see that the kernel uses 3.14GB of data stored on the device memory (VRAM), which must traverse upward through the memory hierarchy (device memory to L2 then to L1). However, what confuses me is that there is 278GB of data from L2 going to L1. I am wondering where that data comes from. I am profiling a kernel that I did not develop, so I do not have the exact implementation details. I just want to know how we can get 278GB of data starting at L2 without having to traverse through device memory.

A plausible hypothesis that would explain these numbers is that the hit rate of the L1 cache is low, while the hit rate of the L2 rate is high. Since the discrepancy is large, one might further hypothesize that the hit rate of the L1 is really low, just a bit over 1%. Check the detailed cache data from the profiler to validate above hypotheses.

If the L1 hit rate is very low and you have the source code for this application, use the profiler to pinpoint which accesses are responsible for L1 cache thrashing and try to improve the locality of access, or try to reduce the data volume, for example by a simple form of “data compression” by using the narrowest datatype possible.

If the L2 hit rate is really high, I would not take any drastic measures to improve the L1 hit rate. If occupancy is high, most of the latency tolerance comes from running numerous threads, and the higher latency from accessing the L2 cache is easily covered. From a power draw perspective it would be best to improve L1 hit rate, though.

1 Like

Data reuse. It’s the name of the game for a cache.

The profiler prints the hit rates right on the diagram.

L1: 2.45%
L2: 103.24% (LoL)

The profiler also color codes the various paths to show percent of peak utilization of a particular path. Since they are all black, I get the impression it is a compute-bound kernel.

1 Like

On my screen, all text in the above image is not readable: I see ant droppings.

Sure. Just pointing it out. If you click on the image it expands (at least, in my chrome browser it does).

~2B global instructions producing ~9B global space requests. triggering 280GB of data reads, so 32bytes/req. Each instruction retrieves 128 bytes. Probably mostly int or float data. Nothing particularly noteworthy there. A compute bound kernel working on int or float data.

An L1 hit rate of only 2.5% seems wasteful, but maybe that is something the GPU architects should ponder rather than CUDA programmers.

In light of the above, given this appears to be largely read only global access, loads via the __ldg mechanism may be beneficial to L1 hit rates.

Is there a way to tell what the data is? I am curios if its basically the same data being read at different times. If that’s the case moving to to physical device memory would be a good starting point.

I don’t know how to tell what the data is from that chart beyond the inferences I have made already. The profiler can tell you about pipe utilization, which could possibly be used to inform if it is int vs. float activity, for example. But I don’t know how to get the profiler to tell you that it is reading the same data over and over.

For sure it is reading the same 3GB of data over and over. That much you can get from the memory chart, as has been discussed already. I also don’t know what you mean by “physical device memory”. The profiler memory chart shows you that it is reading 3GB of data from physical device memory. That blue box is labelled device memory, and the arrow with the 3GB number attached to it indicates reads from device memory (traffic moving from device memory to L2 cache).

I am a little confused on how it can bring in 3GB of data from device memory but read 278GB from L2. The way I imagine how we get the memory chart above is the kernel requesting data, lets say 8MB, for the first time. Its fetched from device memory and brought up to the kernel (traverses through L2 and L1). The kernel then asks for different portions of this data, can not find it in L1, but always finds it in L2. Therefore L2 sends it up. This is how I see such high hit rate in L2, low hit rate in L1, and disproportionate data transfer from device memory and L2. Is this a correct way of thinking?

If an application iterates over a subset of its data, which fits entirely into the L2 cache (hit rate here: 100%), over and over and over again, the cumulative amount of the data read from L2 can grow arbitrarily large, as long as the code keeps re-using the data. This kind of usage pattern could occur (as just one example) in a long-running simulation, e.g. one in which all cells of a grid are updated for each time step, and the simulation runs for many time steps.

You know the app you are executing and should be able to identify which data is used in this fashion. Maybe you are surprised because the code is iterating more times than anticipated. If so, you can add some instrumentation to your code to see why this is that the case. Maybe there is a process of numerical convergence that converges a lot slower than it should. Maybe its a simulation with a time slice that is configured unusually small.

One thing I noticed from Nsight Systems is that almost the entire kernel has a cudaMemcpyAsync happening on the same stream as the kernel is executing on. I am trying to figure out how that may give a clue as to why this kernel has such terrible performance. My thought is that the kernel is waiting for that 3GB of data and finally gets it after awhile. I have this idea because the Warp Statistics section is all stalls, with 85% being a stall barrier.

I don’t know that the kernel has “terrible performance”, but we can leave that aside. Nsight systems reports both API activity and device activity. It’s important to know whether you are looking at the cudaMemcpyAsync as reported in the API section or the device activity section.

A cudaMemcpyAsync running in the device activity section during the kernel execution would be a typical observation for a well-crafted overlap of copy and compute code, and by itself does not necessarily indicate a problem. It will use some device memory bandwidth - typically not a large amount because it usually “throttled” by PCIE transfer rates. Since we’ve already gotten a clue that your kernel is not memory bound - if this is discussing the same as the first observation in this thread - then for me, anyway it would not be an obvious concern.

If the cudaMemcpyAsync observation you make is looking at the API section, then its possible that it was issued prior to the kernel call (or right after the kernel call) and is “outstanding” because it has not completed yet. Again, not necessarily cause for concern. In fact, if the cudaMemcpyAsync involves pageable host memory, then it would be a more-or-less expected observation.

I see cudaMemcpy in Nsight happening in the API. However I see some lasting 30+ seconds for ~3GB of memory (as pictured above). Does that not mean that the stalling is happening due to the cudaMemcpy not being able to grab the data it needs? It would not make sense for it to be 30 seconds for such little data.

You’ll need to understand a few CUDA basics, as well as the difference between the API section and the activity section of the profiler output.

We already know your kernel reads ~300GB of data (from the L1 or L2) at a relatively low rate, based on the picture you posted at the beginning of this thread.

In CUDA a very basic processing sequence might normally have 3 steps:

  1. copy data to the device (perhaps via cudaMemcpy)
  2. process data on the device - via one or more kernel launches
  3. copy results back to the host (perhaps via cudaMemcpy)

In CUDA, kernel launches (step 2 above) are asynchronous. That means they return control to the CPU thread to proceed to the next line of code, even before the kernel has finished executing.

But a cudaMemcpy operation after a kernel launch does not have that behavior. It will not begin to transfer data until the previous kernel launch is complete, and furthermore it will block the CPU thread until the transfer is complete.

So from the perspective of the CPU thread, the kernel launch is generally pretty quick, but the time it spends in the cudaMemcpy call after it (step 3 above) may vary. It will be at least as long as the duration of the previous kernel(s). This time that the CPU thread spends waiting in the cudaMemcpy step corresponds to the timeline of cudaMemcpy in the API section of the profiler output.

When the kernel is complete, then the cudaMemcpy transfer actually begins. From this point, until the transfer is complete, the device activity timeline will show the cudaMemcpy operation.

Now let’s suppose our kernel duration is 30 seconds. This is perhaps plausible, since your kernel is using 300GB of data traffic at a relatively low rate on the device (again, from the picture - this is ground I have already covered.)

In the API section of the timeline, that cudaMemcpy operation after that kernel as we have imagined it, will show up for about 30+ seconds of timeline. However, transferring 3GB of data, let’s say at Gen2 speeds (6GB/s) means that the duration of that same cudaMemcpy operation, in the device activity section, would be only 0.5 seconds.

None of this should be construed to mean anything like:

It simply is reflective of the way CUDA (and the profiler) behave. It doesn’t indicate any sort of problem, nor anything to optimize, per se (except that if you have a kernel that takes 30 seconds, it well be at the top of the pareto list of things to look at - but you already seem to be in that place, here. So nothing new there.)

What I have covered here is by no means a complete tutorial on CUDA. But if you lack an understanding of these basics, such as kernel launches being asynchronous, and how a cudaMemcpy operation behaves, then I have no doubt that the profiler will be harder to understand.

For an orderly introduction to CUDA, I usually recommend this online training course.

OK, thank you very much and that has clarified many things. Based on your write-up on the blocking cudaMemcpy call I decided to look elsewhere. I decided to look at the Warp State Statistics and Instructions Statistics section and noticed a few glaring issues which might be causing the problem

What I am inferring is that there is lots of thread divergence, and there probably is only one or two threads typically active at once that is doing many int computations (IMAD instruction). According to Warp Sampling, the LOP3 instruction is one of the main offenders. It is my understanding that LOP3 is used to look up a 3-tuple of some binary statement (in this case I am assuming the branch condition) and the reason we see almost the same amount of BRA instructions is because the LOP3 instruction is not not hitting on most threads in the warp thus causing massive thread divergence.

If my understanding is correct, I am curious as to how I can use this information to identify where in the code this is happening. I want to identify the piece of code causing thread divergence to potentially rewrite that part of the kernel to remove this bottleneck.

I’m not sure I could reach those conclusions.

Yes, certainly the profiler is telling you there is a mismatch between the work done by various threads. Amongst the threads in a threadblock, some threads (or possibly whole warps) are taking much longer than other warps.

I don’t know what “offender” means. A LOP3 instruction takes a set of inputs and performs generally boolean work on them. It’s not otherwise remarkable. It’s not any more remarkable than a code having IMAD in it (which is a bigger “offender”, here).

I wouldn’t say that. Remember that in CUDA, when an instruction is issued, it is issued warp-wide. Period. So if one thread in a warp is doing a LOP3, they all are. Likewise, if one thread in a warp is doing a BRA, they all are. Regardless of how many/few of the threads take the branch (assuming it is predicated). So the correspondence between LOP3 and BRA to me is unremarkable.

I would focus on the expert system guidance the profiler has already given you, in the “Barrier” and “Thread Divergence” commentary. Seek to understand the work being done by each thread, and why there could be such a mismatch from one thread to the other.

It’s not realistic to imagine you are going to eliminate warp divergence. But extended divergence is what is killing your performance. I wouldn’t focus on the divergence so much as the reason for the huge mismatch. Yes, divergence is obviously part of that. But divergence by itself is not the enemy, it’s a natural part of coding.

Look for if statements, or other conditional constructs in your code (while, etc.), that don’t evaluate the same across threads. But imagining you are going to eliminate it isn’t sensible.

Let’s take an example. Suppose you are going through a large array, and if you detect a certain condition at a particular element detected by a particular thread, that thread has a lot of “extra” work to do. But the condition happens only rarely (maybe less than 10% or 3% of the time). You could write a kernel that does everything monolithicly. Test each data item, one per thread, perhaps in a grid stride loop because this is a large array, and when you detect the condition (if …) then the body of that if statement is a lot of work. But other threads in the warp have “nothing to do”.

That methodology could lead to less than optimal performance.

Instead you could do things in 2 steps:

  1. The first kernel goes thru the array and identifies the indices of the rarely-occurring conditions that need a lot of extra work. The identification of the index and noting it in an array is a relatively small diversion for the thread if/when it happens.

  2. A second kernel has every thread effectively doing the body of the if-statement I mentioned for the monolithic kernel, but since you have pre-determined the needed spots in the array for this, every thread is doing the same amount of work (roughly). This could lead to much better average machine utilization, both for the first kernel and the second kernel, compared to the monolithic approach.

I’m not suggesting this is exactly what is happening in your code. I don’t get all that made-up stuff just by looking at the profiler data you have provided. But that is the kind of thought process I would use:

  • identify the work to be done, per thread
  • identify what conditions lead to imbalance
  • see if I can come with a work partitioning strategy to reduce the imbalance, on average

Based on what I see in the profiler, your code could be doing a lot of

START:
LD ...
IMAD ...
IMAD ...
IADD3 ...
LOP3 ...
ISETP P0, ...
@P0 BRA START
DETOUR:
...
...
BRA START

That’s overall pretty unremarkable. But the (first) BRA is probably rarely not taken. It’s consistent with a loop, going through an array, looking for a condition, with a rarely executed, but lengthy detour More-or-less consistent with my made-up example.

I am not to sure what you mean by extended divergence.
Edit: I believe you mean divergence at the block level.

From the example given (which presents one plausible scenario that could explain the profiler statistics), “extended” presumably refers to a high dynamic instruction count between the point of divergence and the subsequent re-convergence of control flow.

CUDA programmers are sometimes overly worried about divergence, trying to convert all branch-y code into non branch-y code. But brief divergence, across a few instructions, is usually not harmful to performance (plus the compiler automatically eliminates many potential “local divergence” scenarios, e.g. by if-conversion). What is harmful to performance is extended divergence, as in the example by Robert_Crovella: Threads normally iterate collectively in a loop of, say, 10 instructions, but sometimes one thread needs to handle an exceptional case that requires it to execute 200 instructions by itself, while the other threads in the warp do nothing in the mean time, simply waiting at the convergence point.

Look in your code where such situations could arise where, starting at an if-statement typically, one thread has to handle a lot of work while many other threads are doing nothing or very little.

Generally speaking, discussing profiler reports without access to the code that was profiled is going to lead to much speculation and will likely be not very productive.

So I found something that might be interesting. The kernel is representing 99.9% branch efficiency. I find this to be weird because the kernel is mostly a while loop with the condition be a function of threadIdx.x. However, all instructions in the kernel are executed at .01% to .02% where as the small block is more than half


Is it possible that branch efficiency is so high because the branch at the bottom of the block is taken many times? If so, could this be the source of why this kernel takes 1000x the total time of the other kernel executions?