Uint64_t result evaluation & storage eats up 25% of kernel performance

Thanks for the details. Next week, I will dig myself through the blogs Robert Crovella linked. Trial and error doesn’t work anymore here.

A process that works well for me is to switch back and forth between reading (documentation, published papers), and targeted, methodical experimentation until I have a solid understanding of the use case at hand.

Having practiced this for decades, I would say documentation is wrong or misleading occasionally, while there are serious flaws in peer-reviewed publications 50% of the time (“this process has quadratic convergence”: tried it, no it doesn’t; “this error bound holds across the entire input domain”: tried it, no it doesn’t; etc.).

Feedback on present subject.

As previously assumed, the result evalution would present a bottleneck.
This is confirmed now.

My inital code (CODE1):

#pragma unroll 1 // anything else slows down the code
for (i = 0; i < 2; i++) {
	#pragma unroll 32 // anything else slows down the code
	for (j = 0; j < 32; j++) {
		if (a[i] != b[j]) // a[] is hardcoded at compile time
			continue;
		int hitsc = atomicAdd(hitsn, 1);
		hits[titsc].tid = tid;			
	}
}

My present code (CODE2):

uint32_t htid[20]; // local variable used to store potential hit
uint32_t htn = 0; // hit counter
#pragma unroll 1 // anything else slows down the code
for (i = 0; i < 2; i++) {
	#pragma unroll 32 // anything else slows down the code
	for (j = 0; j < 32; j++) {
		if (a[i] != b[j]) // a[] is hardcoded at compile time
			continue;
		htid[htn] = j;
		htn++;
	}
}
if (htn == 0) // return if no hit
	return;
for (i = 0; i < htn; i++) { // populate function argument (hits[]) to be passed to host
	int hitsc = atomicAdd(hitsn, 1);
	hits[hitsc].tid = tid;
	hits[hitsc].htid = htid[i];
}

CODE2 is 45% faster than CODE1.

Nsight Compute shows CODE2 w/ 2.6mio branch instructions, while CODE1 has 136.8mio. This probably the reason why CODE1 is so slow. However I have no idea why nvcc adds so many branches (I’m not a pro in analyzing SASS). However it looks like the GPU executes the following code at all times (CODE1) …

int hitsc = atomicAdd(hitsn, 1);
hits[titsc].tid = tid;

… even if it is not called (and it is virtually never called). This seems to be much more expensive than (CODE2) …

htid[htn] = j;
htn++;

I have no clue why never executed code relevant speaking of executing times, and why there are so many branch instructions in CODE1 compared to CODE2.

This post could be relevant, but I don’t know how I could alter my code to avoid this.

I miss something basic.

It’s possible in one case that the compiler is using predication to handle the atomic, whereas in the other case it is using a “regular” conditional jump, when needed. In the first case, the atomic might be predicated off for the entire warp, but might still show up as an issued instruction in nsight compute.

None of this can be confirmed without studying SASS. Even then, I don’t have enough intuition to comment on the performance implications of the two cases, but you seem to already have a measurement of that.

As an aside, if it were me doing this, and I discovered that I was not using the latest CUDA version, I would immediately switch to using the latest CUDA version. But that is just me, perhaps.

There will problems like this where the (low level) coding choices (made by the compiler) and their relative performance may be data dependent.

For example, suppose we are testing a condition across a warp, and if it is true for any thread, then we must do an “expensive” operation (let’s think of it as several instructions, or more) for that thread. Suppose the condition true/false characteristic is data dependent.

The compiler has the choice to implement it using predication (“inline”), or else via a conditional jump. Let’s assume the conditional jump takes longer to execute, if it is taken, than the predication method. Let’s also assume that the conditional jump version is less expensive than the predicated version, if the condition evaluates false.

Now, if any thread in the warp happens to test the condition is true, given those stipulations, it is better for the compiler to handle it via predication.

In this case, we could say that if the warp is processing data, and we observe higher than a 1/32 chance that that data will result in evaluation of the condition such that the atomic needs to be done, then predication will be a preferred code realization (the probability breakpoint is actually closer to 2% for random sampling → 0.98^32 = 0.5, but to do a proper modeling we would also want to account for the relative costs, to determine which realization would be statistically faster)

The only way that the conditional jump would be a better choice (code realization) is if the condition is true less often then about 3% (or 2%) of the time. Given that, it might seem to be the case that the compiler “prefers” predication.

C++20 includes likely and unlikely hints, so you can guide an “aware” compiler when you know such a thing statistically, a-priori.

Although I imagine CUDA “supports” that, since it supports C++20, I don’t know to what level the compiler may use such information. It also seems to me that a simple “likely”/“unlikely” hint is not enough to capture the nuances of the case I outlined, where we might want to know if it is more or less than 98% likely that a branch would not be taken.

CUDA 12.2 supports the C++ branch attributes. From the Programming Guide:

17.5.18. [[likely]] / [[unlikely]] Standard Attributes

I would suggest giving this a try. Since predication results in code that is equivalent to executing on both sides of a branch, it is only an unambiguous win in terms of performance over branch-y code for very short instruction sequences, say half a dozen instructions combined for the if-branch and the else-branch.

In the absence of information about branch behavior the compiler needs to make assumptions (that is, use heuristics) about the distribution of branch outcomes. Best I know, the CUDA compiler does not currently implement profile-directed optimizations, so providing the C++ attributes is the best one can do in terms of supplying more information to the compiler.

1 Like

A simple test case didn’t seem to turn up any difference in code generation. It’s possible my test case is not designed correctly to elicit any benefit:

# cat t50.cu
using mt = float;
#include <math.h>
__global__ void k(mt *data, const size_t ds){

  for (int  i = threadIdx.x; i < ds; i++) {
#ifdef USE_LIKELY
    if (data[i] < 4) [[likely]] {
#else
    if (data[i] < 4) [[unlikely]] {
#endif
#ifdef USE_SHORT
      data[i] += 2;
#else
      data[i] += normcdff(data[i]);
#endif
    }
  }
}

int main(){
  const size_t ds = 1048576;
  mt *data = NULL;
  k<<<1,1>>>(data, ds);
  cudaDeviceSynchronize();
}
# nvcc -o t50 t50.cu -arch=sm_89 -DUSE_LIKELY -DUSE_SHORT
# cuobjdump -sass ./t50 >LS.txt
# nvcc -o t50 t50.cu -arch=sm_89 -DUSE_SHORT
# cuobjdump -sass ./t50 >US.txt
# diff LS.txt US.txt
# nvcc -o t50 t50.cu -arch=sm_89 -DUSE_LIKELY
# cuobjdump -sass ./t50 >LL.txt
# nvcc -o t50 t50.cu -arch=sm_89
# cuobjdump -sass ./t50 >UL.txt
# diff LL.txt UL.txt
#

CUDA 12.2, nvcc V12.2.128

1 Like

Better to try and fail than not to try at all …

I remember having discussions about likely/unlikely attributes with the CUDA compiler team in the 2012-2014 time frame and I recall them being skeptical about the efficacy of this approach and generally leaning towards rejecting such an approach. I am not familiar with the discussions the ISO-C++ standards committee had that led to the inclusion of these attributes.

I am not a compiler engineer, so I can neither confirm nor refute the earlier assessment by the CUDA compiler team. Such attributes might generally be of limited utility, or may be difficult to utilize within the specific framework (LLVM) adopted for the CUDA compiler, or may simply not be to the liking of the powers that be. I have no knowledge as to which applies, or whether there are additonal aspects not on my list.

Both codes are using predication. I see this in SASS as well as in the difference of binary size when I set j to 1 i.s.o. 32. That predicated instructions show up in Nsight Compute as issued instruction is indeed something I didn’t think about. Also, I have the latest CUDA.

Interesting (again). The code compare is true in less that 0.001% of the time. All code uses predication. Virtually no BRAs in the entire code.

14.5.18 - never heart of these attributes. I tested - unfortunately no change at all.

Thanks a lot for the answers.

Both codes are using predication. I see this in SASS as well as in the difference of binary size when I set j to 1 i.s.o. 32. That predicated instructions show up in Nsight Compute as issued instruction is indeed something I didn’t think about.

That’s it! Any I tried a lot.

I came to a point, where the performance of the code is acceptable to me. However I know, by injecting a simple result evaluation code (just a flag passed to the host), that it is possible to increase performance by another 15-20%. I still have a (mental) problem to accept that these 64 trivial a[i] != b[j] result compare operations (even considering predicated true/false code) is so costly.

Thanks a lot for the answers - I learned !!!