Uint64_t result evaluation & storage eats up 25% of kernel performance

I have some performance questions/issues regarding following kernel code. It’s basically calculating a value (b), which needs to match two hardcoded 64-bit tokens (a[]). Struct hits is dimensioned to 20, so it can hold 20 hits per kernel call (which is sufficient).

I know that code, which is never called/used, is removed by the nvcc at compile time.
I also know that branching should be avoided.
I know I should not use printf() in device code (was for testing only).

There is however much I can’t see a reason for.

struct Hits {
    uint64_t tid;
    int      bid;
    uint64_t b;
};

I show some variations of the code in order to better illustrate the questions.

kernel 1:

__global__ void mantas_kernel(
	struct Hits *hits,
	int         *hitsn,
	uint64_t    *test) {

	uint64_t tid = ...
	uint64_t a[] = {
		0x3e24b1546a138f46,
		0xeaa667f23491cf47
	};

	// calculate b (complex stuff)
	
	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			hits[0].tid = tid;			
		}
	}
}

performance: 33.1 Gop/s
This is basically the fastest I get, but all is stored in array hits[0], which hides >1 potential hits. Hence not practicable.

kernel 2:

	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			int hitsc = atomicAdd(hitsn, 1);
			hits[hitsc].tid = tid;			
		}
	}

performance: 25.5 Gop/s
hits can now hold more than 1 hit, but the performance dropped. I worked already with atomics before, but never had such a huge performance loss. Why is that?

kernel 3:
This kernel has a race condition, as pointed out by here by striker159.

	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			hits[*hitsn].tid = tid;			
			atomicAdd(hitsn, 1);
		}
	}

performance: 27.2 Gop/s
The old variable prior atomic operation is skipped. Performance is slightly better. I also don’t see here why an additional int (hitsc in example before) slowed down previous code down by 5%.

kernel 4:

	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			printf("hit!\n");
		}
	}

performance: 28.3 Gop/s
For testing only. However I never saw 10% performance loss due printf().

I can’t believe that the complex calculation of b is very fast, but then a “simple” 64-bit result compare and subsequent struct-storage for host evaluation eats up to 25% of the performance.

Also … there are kernel calls which don’t come up with any hit. Hence the code inside the if() block is never called. I understand that the compiler doesn’t know that at compile time (so the code must be in), but why does it slow down the entire kernel?

All on RTX 2080 SUPER.

That’s how I measure the performance:

		HANDLE_ERROR( cudaEventRecord(start) );
		mantas_kernel <<< grids,threads >>> (hits, hitsn, test);
		HANDLE_ERROR( cudaDeviceSynchronize() );
		HANDLE_ERROR( cudaEventRecord(stop) );
		HANDLE_ERROR( cudaEventSynchronize(stop) );

What am I missing?

Let me guess. You have been evaluating performance by commenting-out code.

What kind of operations do you measure(Gop/s)?
What Gpu are you using?
Please provide a fully working example code that others could use to benchmark.

Some thoughts:

The second atomic kernel has a race condition with *hitsn. Multiple threads could see the same value.

To me, this looks like it could be implemented with Thrust, using a combination of transform iterator + copy_if

Correct.

Correct. I missed that one (in kernel 3 above). Thanks.

compute-sanitizer doesn’t give any error on any of the 4 kernels.

certainly the point of my bringing up that old article was not to suggest using compute-sanitizer. You may want to revisit the first 2 sentences:

commenting out code can be quite a confusing strategy for either performance or debug when using an aggressively optimizing compiler.

I read, re-read, … what you said, but I don’t get how a commented-out line can affect compiler results (= the final binary).

Just to be sure - I removed all commented lines, but this didn’t impact performance.

Suppose I have code like this:

__global__ void k(int *a, int *b){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int r = b[idx]*b[idx] + a[idx]*a[idx];
  a[idx] = r;
}

Let’s say that code takes 1 ms to run for a very large data set. Now let’s say I comment out the last line: a[idx] = r; and rerun the timing. Now I observe that the kernel takes 10 microseconds to run. I might conclude “99% of the runtime cost of that kernel was in the final store operation” (not unlike the title of this thread).

That would be an incorrect conclusion. The reason the runtime got so dramatically shorter is that the compiler observed that without that final line, there was no change to observable state as a result of that kernel code, so all the kernel code can be deleted, and the final result will be the same.

An optimizing compiler does this. So commenting out code to estimate performance is fraught with peril.

This might or might not be happening here. You haven’t provided a complete code or a well-defined test case, so I can’t be sure of anything. If you’re convinced its irrelevant, just ignore my comments and I will certainly stop responding.

Thanks for your answer.

I know that code, which is never called/used, is removed by the nvcc at compile time since it is useless. Much calculation without any result evaluation will be detected by the compiler and disregarded. In my case, I’m 99% sure that’s not the case, since all boils down at kernel end to the a[] != b[] compare operation - in all shown kernel scenarios. The performance differences must come from the code inside the if() == true block. That’s what I tried to show with my code excerpts.

Besides that, all ideas/suggestions, from you or anybody else (1.) triggers review/re-design of existing code and (2.) possibly new ideas/approaches, which might lead to the solution. So every comment is useful, even if it doesn’t provide immediate solution. So please keep on going …

The complete functioning code is too exhaustive. I will try to produce a fully functioning reduced version if I don’t solve the issue pretty soon, but it will be complicated and still be lengthy. Meanwhile I try with specific examples pointing out (strange) code behavior, hoping that this might ring a bell what could be the culprit?

Look at the following observations:

	uint32_t i, j;
	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			hits[0].tid = tid;     // (ex.1) 33.3 Gops/s
//			hits[0].tid = tid + 1; // (ex.2) 33.3 Gops/s
//			hits[0].tid = tid + i; // (ex.3) 32.4 Gops/s
//			hits[0].tid = tid + j; // (ex.4) 30.9 Gops/s
//			hits[0].tid = 1;       // (ex.5) 32.8 Gops/s
//			hits[0].tid = i;       // (ex.6) 32.2 Gops/s
//			hits[0].tid = j;       // (ex.7) 30.7 Gops/s
		}
	}

ex.1 vs ex.5: why is setting a constant number faster than accessing variable (register)?
ex.6 vs ex.7: i and j are both uint32_t in a register. Why is one faster than the other?

I believe it’s all about a memory access problem.

I would suggest using the CUDA profiler to pin-point the bottleneck(s).

:)

I have no clue where to start considering my “issue”?

A general, clean-slate approach for using the profiler is part of an analysis-driven optimization approach. These blogs 1 2 may shed some light on that. I understand you may not want to start with a clean slate. However, I’m not able to offer any more direction than that. I don’t understand what your “issue” is, and the dialog so far has not been enlightening for me personally. Nothing comes to mind (other than beginning a structured approach to performance analysis).

I’m in progress of checking the blogs 1 and 2 above. Will take some time to digest.

Meanwhile, I try to understand why code, which is virtually never executed (the if() statement goes true only about once a day) reduces performance by approx. 25%? I understand that during branch divergence, the path separates which reduces the performance, but in my case, the code inside the branch is (1.) hardly ever executed and (2.) very short (an atomicAdd operation and a parameter save operation into a struct).

Articles below are only partially applicable, but the answers/comments are very interesting to understand branching and how the compiler deals with registers. However nothing helped to increase performance. So you know what I tested and excluded.

I checked the SASS code for the if() statement below, suspecting global memory access. But by I believe (please confirm) that variable a and b are stored in registers ?!

if (a[i] != b[j])
    continue;

5628	00000013 0204b5b0	     BMOV.32.CLEAR RZ, B0 
5629	00000013 0204b5c0	     BSSY B0, 0x130204b760 
5630	00000013 0204b5d0	     ISETP.NE.U32.AND P6, PT, R6, R131, PT 
5631	00000013 0204b5e0	     ISETP.NE.U32.AND P5, PT, R6, R123, PT 
5632	00000013 0204b5f0	     ISETP.NE.AND.EX P6, PT, R7, R129, PT, P6 
5633	00000013 0204b600	     ISETP.NE.U32.AND P4, PT, R6, R119, PT 
5634	00000013 0204b610	     ISETP.NE.U32.AND P3, PT, R6, R101, PT 
5635	00000013 0204b620	     ISETP.NE.U32.AND P2, PT, R6, R127, PT 
5636	00000013 0204b630	     ISETP.NE.U32.AND P1, PT, R6, R113, PT 
5637	00000013 0204b640	     ISETP.NE.U32.AND P0, PT, R6, R103, PT 
5638	00000013 0204b650	@P6  BRA 0x130204b750 

Besides that, I managed to increase performance by 5% by using #pragma unroll 1 for the first loop. Perhaps this is a hint about what happens behind the scene.

#pragma unroll 1
	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			int hitsc = atomicAdd(hitsn, 1);
			hits[titsc].tid = tid;			
		}
	}

BTW … this is what the compiler tells me:

1>ptxas info    : Function properties for _Z16mantas_kerneliP4HitsPjPy
1>    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 168 registers, 352 bytes cmem[0], 20 bytes cmem[2]

I wonder if you could get a speed-up by limiting the kernel to 128 registers. You would see some spill loads and stores, but your occupancy may rise.

Good idea, but performance dropped by 36%.

1>ptxas info    : Function properties for _Z16mantas_kerneliP4HitsPjPy
1>    192 bytes stack frame, 284 bytes spill stores, 300 bytes spill loads
1>ptxas info    : Used 128 registers, 352 bytes cmem[0], 36 bytes cmem[2]

Generally speaking, the heuristics used by the CUDA compiler result in good choices when trading off occupancy against register usage. In my experience, manually squeezing the register count below the limit chosen by the compiler by more than 2 or 3 registers will lead to a decrease in performance most of the time. In other words, reducing from 131 to 128 registers may prove beneficial for performance, but reducing from 168 to 128 registers is exceedingly unlikely to do so.

The 168 registers (peak) are used only prior the if() condition (this is confirmed by Nsight Compute), which is at the very end. The performance drop resides inside the result evaluation code, which does not very much (see above). Therefore I believe it’s the memory access of b[] which slows down the code. a[] is hardcoded by nvcc as you can see inside the SASS excerpt. But b[] I see in registers (see SASS above), which should be fast.

I also tried to avoid the if() condition completely, but this slows down the code even further.

Strange also that the #pragma unroll 1 directive on first loop increased performance by 5%. I would have thought it should be the other way around.

The point of maximum register usage dictates how many registers need to be allocated for the entire kernel, independent of whether this point is in a frequently-used part of the code or not. In fact, it is not unusual to have a kernel with a “slim” fast path requiring relatively few registers, and that is used the vast majority of the time, and an infrequently exercised “fat” slow path that requires more registers than the fast path because it is responsible for the complex handling of gnarly special cases.

It is not advisable to make such assumptions. Look at the generated machine code (SASS) and use the CUDA profiler to explore. One scenario where a fully rolled loop could be faster than a partially-unrolled loop is when the size of the latter exceeds the size of the rather small instruction cache (keep in mind that every machine instruction occupies 8 bytes and is coupled with another 8 bytes of operation-steering data; a single instruction thus requires 16 bytes).