Uint64_t result evaluation & storage eats up 25% of kernel performance

geohei · September 16, 2023, 9:26am

Thanks for the details. Next week, I will dig myself through the blogs Robert Crovella linked. Trial and error doesn’t work anymore here.

njuffa · September 16, 2023, 10:02am

A process that works well for me is to switch back and forth between reading (documentation, published papers), and targeted, methodical experimentation until I have a solid understanding of the use case at hand.

Having practiced this for decades, I would say documentation is wrong or misleading occasionally, while there are serious flaws in peer-reviewed publications 50% of the time (“this process has quadratic convergence”: tried it, no it doesn’t; “this error bound holds across the entire input domain”: tried it, no it doesn’t; etc.).

geohei · September 27, 2023, 3:42pm

Feedback on present subject.

As previously assumed, the result evalution would present a bottleneck.
This is confirmed now.

My inital code (CODE1):

#pragma unroll 1 // anything else slows down the code
for (i = 0; i < 2; i++) {
	#pragma unroll 32 // anything else slows down the code
	for (j = 0; j < 32; j++) {
		if (a[i] != b[j]) // a[] is hardcoded at compile time
			continue;
		int hitsc = atomicAdd(hitsn, 1);
		hits[titsc].tid = tid;			
	}
}

My present code (CODE2):

uint32_t htid[20]; // local variable used to store potential hit
uint32_t htn = 0; // hit counter
#pragma unroll 1 // anything else slows down the code
for (i = 0; i < 2; i++) {
	#pragma unroll 32 // anything else slows down the code
	for (j = 0; j < 32; j++) {
		if (a[i] != b[j]) // a[] is hardcoded at compile time
			continue;
		htid[htn] = j;
		htn++;
	}
}
if (htn == 0) // return if no hit
	return;
for (i = 0; i < htn; i++) { // populate function argument (hits[]) to be passed to host
	int hitsc = atomicAdd(hitsn, 1);
	hits[hitsc].tid = tid;
	hits[hitsc].htid = htid[i];
}

CODE2 is 45% faster than CODE1.

Nsight Compute shows CODE2 w/ 2.6mio branch instructions, while CODE1 has 136.8mio. This probably the reason why CODE1 is so slow. However I have no idea why nvcc adds so many branches (I’m not a pro in analyzing SASS). However it looks like the GPU executes the following code at all times (CODE1) …

int hitsc = atomicAdd(hitsn, 1);
hits[titsc].tid = tid;

… even if it is not called (and it is virtually never called). This seems to be much more expensive than (CODE2) …

htid[htn] = j;
htn++;

I have no clue why never executed code relevant speaking of executing times, and why there are so many branch instructions in CODE1 compared to CODE2.

This post could be relevant, but I don’t know how I could alter my code to avoid this.

I miss something basic.

Robert_Crovella · September 27, 2023, 6:18pm

It’s possible in one case that the compiler is using predication to handle the atomic, whereas in the other case it is using a “regular” conditional jump, when needed. In the first case, the atomic might be predicated off for the entire warp, but might still show up as an issued instruction in nsight compute.

None of this can be confirmed without studying SASS. Even then, I don’t have enough intuition to comment on the performance implications of the two cases, but you seem to already have a measurement of that.

As an aside, if it were me doing this, and I discovered that I was not using the latest CUDA version, I would immediately switch to using the latest CUDA version. But that is just me, perhaps.

Robert_Crovella · September 30, 2023, 3:34pm

There will problems like this where the (low level) coding choices (made by the compiler) and their relative performance may be data dependent.

For example, suppose we are testing a condition across a warp, and if it is true for any thread, then we must do an “expensive” operation (let’s think of it as several instructions, or more) for that thread. Suppose the condition true/false characteristic is data dependent.

The compiler has the choice to implement it using predication (“inline”), or else via a conditional jump. Let’s assume the conditional jump takes longer to execute, if it is taken, than the predication method. Let’s also assume that the conditional jump version is less expensive than the predicated version, if the condition evaluates false.

Now, if any thread in the warp happens to test the condition is true, given those stipulations, it is better for the compiler to handle it via predication.

In this case, we could say that if the warp is processing data, and we observe higher than a 1/32 chance that that data will result in evaluation of the condition such that the atomic needs to be done, then predication will be a preferred code realization (the probability breakpoint is actually closer to 2% for random sampling → 0.98^32 = 0.5, but to do a proper modeling we would also want to account for the relative costs, to determine which realization would be statistically faster)

The only way that the conditional jump would be a better choice (code realization) is if the condition is true less often then about 3% (or 2%) of the time. Given that, it might seem to be the case that the compiler “prefers” predication.

C++20 includes likely and unlikely hints, so you can guide an “aware” compiler when you know such a thing statistically, a-priori.

Although I imagine CUDA “supports” that, since it supports C++20, I don’t know to what level the compiler may use such information. It also seems to me that a simple “likely”/“unlikely” hint is not enough to capture the nuances of the case I outlined, where we might want to know if it is more or less than 98% likely that a branch would not be taken.

njuffa · September 30, 2023, 8:26pm

CUDA 12.2 supports the C++ branch attributes. From the Programming Guide:

17.5.18. [[likely]] / [[unlikely]] Standard Attributes

I would suggest giving this a try. Since predication results in code that is equivalent to executing on both sides of a branch, it is only an unambiguous win in terms of performance over branch-y code for very short instruction sequences, say half a dozen instructions combined for the if-branch and the else-branch.

In the absence of information about branch behavior the compiler needs to make assumptions (that is, use heuristics) about the distribution of branch outcomes. Best I know, the CUDA compiler does not currently implement profile-directed optimizations, so providing the C++ attributes is the best one can do in terms of supplying more information to the compiler.

Robert_Crovella · September 30, 2023, 9:17pm

A simple test case didn’t seem to turn up any difference in code generation. It’s possible my test case is not designed correctly to elicit any benefit:

# cat t50.cu
using mt = float;
#include <math.h>
__global__ void k(mt *data, const size_t ds){

  for (int  i = threadIdx.x; i < ds; i++) {
#ifdef USE_LIKELY
    if (data[i] < 4) [[likely]] {
#else
    if (data[i] < 4) [[unlikely]] {
#endif
#ifdef USE_SHORT
      data[i] += 2;
#else
      data[i] += normcdff(data[i]);
#endif
    }
  }
}

int main(){
  const size_t ds = 1048576;
  mt *data = NULL;
  k<<<1,1>>>(data, ds);
  cudaDeviceSynchronize();
}
# nvcc -o t50 t50.cu -arch=sm_89 -DUSE_LIKELY -DUSE_SHORT
# cuobjdump -sass ./t50 >LS.txt
# nvcc -o t50 t50.cu -arch=sm_89 -DUSE_SHORT
# cuobjdump -sass ./t50 >US.txt
# diff LS.txt US.txt
# nvcc -o t50 t50.cu -arch=sm_89 -DUSE_LIKELY
# cuobjdump -sass ./t50 >LL.txt
# nvcc -o t50 t50.cu -arch=sm_89
# cuobjdump -sass ./t50 >UL.txt
# diff LL.txt UL.txt
#

CUDA 12.2, nvcc V12.2.128

njuffa · September 30, 2023, 11:03pm

Better to try and fail than not to try at all …

I remember having discussions about likely/unlikely attributes with the CUDA compiler team in the 2012-2014 time frame and I recall them being skeptical about the efficacy of this approach and generally leaning towards rejecting such an approach. I am not familiar with the discussions the ISO-C++ standards committee had that led to the inclusion of these attributes.

I am not a compiler engineer, so I can neither confirm nor refute the earlier assessment by the CUDA compiler team. Such attributes might generally be of limited utility, or may be difficult to utilize within the specific framework (LLVM) adopted for the CUDA compiler, or may simply not be to the liking of the powers that be. I have no knowledge as to which applies, or whether there are additonal aspects not on my list.

geohei · October 3, 2023, 2:41pm

Both codes are using predication. I see this in SASS as well as in the difference of binary size when I set j to 1 i.s.o. 32. That predicated instructions show up in Nsight Compute as issued instruction is indeed something I didn’t think about. Also, I have the latest CUDA.

Interesting (again). The code compare is true in less that 0.001% of the time. All code uses predication. Virtually no BRAs in the entire code.

14.5.18 - never heart of these attributes. I tested - unfortunately no change at all.

Thanks a lot for the answers.

Both codes are using predication. I see this in SASS as well as in the difference of binary size when I set j to 1 i.s.o. 32. That predicated instructions show up in Nsight Compute as issued instruction is indeed something I didn’t think about.

That’s it! Any I tried a lot.

I came to a point, where the performance of the code is acceptable to me. However I know, by injecting a simple result evaluation code (just a flag passed to the host), that it is possible to increase performance by another 15-20%. I still have a (mental) problem to accept that these 64 trivial a[i] != b[j] result compare operations (even considering predicated true/false code) is so costly.

Thanks a lot for the answers - I learned !!!

Topic		Replies	Views
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2408	October 25, 2017
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1333	March 25, 2022
Used Registers vs Live Registers CUDA Programming and Performance	14	3360	June 28, 2020
Question about : Kernel optimization , ptaxs register usage, branch divergence, warm up kernel runs CUDA Programming and Performance hw , cuda	5	326	May 7, 2024
Performance penalty due to warp divergence CUDA Programming and Performance	9	1546	May 18, 2023
Why compiler prefer to use registers to cache hot data rather than constant memory? CUDA Programming and Performance	22	1491	November 7, 2022
A block size less than 32? CUDA Programming and Performance	37	7893	December 17, 2018
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	455	March 3, 2024
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8427	March 1, 2015
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	319	April 3, 2024

Uint64_t result evaluation & storage eats up 25% of kernel performance

Related topics