Uint64_t result evaluation & storage eats up 25% of kernel performance

geohei · September 12, 2023, 3:02pm

I have some performance questions/issues regarding following kernel code. It’s basically calculating a value (b), which needs to match two hardcoded 64-bit tokens (a[]). Struct hits is dimensioned to 20, so it can hold 20 hits per kernel call (which is sufficient).

I know that code, which is never called/used, is removed by the nvcc at compile time.
I also know that branching should be avoided.
I know I should not use printf() in device code (was for testing only).

There is however much I can’t see a reason for.

struct Hits {
    uint64_t tid;
    int      bid;
    uint64_t b;
};

I show some variations of the code in order to better illustrate the questions.

kernel 1:

__global__ void mantas_kernel(
	struct Hits *hits,
	int         *hitsn,
	uint64_t    *test) {

	uint64_t tid = ...
	uint64_t a[] = {
		0x3e24b1546a138f46,
		0xeaa667f23491cf47
	};

	// calculate b (complex stuff)
	
	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			hits[0].tid = tid;			
		}
	}
}

performance: 33.1 Gop/s
This is basically the fastest I get, but all is stored in array hits[0], which hides >1 potential hits. Hence not practicable.

kernel 2:

	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			int hitsc = atomicAdd(hitsn, 1);
			hits[hitsc].tid = tid;			
		}
	}

performance: 25.5 Gop/s
hits can now hold more than 1 hit, but the performance dropped. I worked already with atomics before, but never had such a huge performance loss. Why is that?

kernel 3:
This kernel has a race condition, as pointed out by here by striker159.

	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			hits[*hitsn].tid = tid;			
			atomicAdd(hitsn, 1);
		}
	}

performance: 27.2 Gop/s
The old variable prior atomic operation is skipped. Performance is slightly better. I also don’t see here why an additional int (hitsc in example before) slowed down previous code down by 5%.

kernel 4:

	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			printf("hit!\n");
		}
	}

performance: 28.3 Gop/s
For testing only. However I never saw 10% performance loss due printf().

I can’t believe that the complex calculation of b is very fast, but then a “simple” 64-bit result compare and subsequent struct-storage for host evaluation eats up to 25% of the performance.

Also … there are kernel calls which don’t come up with any hit. Hence the code inside the if() block is never called. I understand that the compiler doesn’t know that at compile time (so the code must be in), but why does it slow down the entire kernel?

All on RTX 2080 SUPER.

That’s how I measure the performance:

		HANDLE_ERROR( cudaEventRecord(start) );
		mantas_kernel <<< grids,threads >>> (hits, hitsn, test);
		HANDLE_ERROR( cudaDeviceSynchronize() );
		HANDLE_ERROR( cudaEventRecord(stop) );
		HANDLE_ERROR( cudaEventSynchronize(stop) );

What am I missing?

Robert_Crovella · September 12, 2023, 3:29pm

Let me guess. You have been evaluating performance by commenting-out code.

striker159 · September 12, 2023, 3:31pm

What kind of operations do you measure(Gop/s)?
What Gpu are you using?
Please provide a fully working example code that others could use to benchmark.

Some thoughts:

The second atomic kernel has a race condition with *hitsn. Multiple threads could see the same value.

To me, this looks like it could be implemented with Thrust, using a combination of transform iterator + copy_if

geohei · September 12, 2023, 3:35pm

Correct.

geohei · September 12, 2023, 3:48pm

Correct. I missed that one (in kernel 3 above). Thanks.

Robert_Crovella · September 12, 2023, 3:48pm

geohei · September 12, 2023, 4:08pm

compute-sanitizer doesn’t give any error on any of the 4 kernels.

Robert_Crovella · September 12, 2023, 4:09pm

certainly the point of my bringing up that old article was not to suggest using compute-sanitizer. You may want to revisit the first 2 sentences:

commenting out code can be quite a confusing strategy for either performance or debug when using an aggressively optimizing compiler.

geohei · September 12, 2023, 5:04pm

I read, re-read, … what you said, but I don’t get how a commented-out line can affect compiler results (= the final binary).

Just to be sure - I removed all commented lines, but this didn’t impact performance.

Robert_Crovella · September 12, 2023, 6:16pm

Suppose I have code like this:

__global__ void k(int *a, int *b){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int r = b[idx]*b[idx] + a[idx]*a[idx];
  a[idx] = r;
}

Let’s say that code takes 1 ms to run for a very large data set. Now let’s say I comment out the last line: a[idx] = r; and rerun the timing. Now I observe that the kernel takes 10 microseconds to run. I might conclude “99% of the runtime cost of that kernel was in the final store operation” (not unlike the title of this thread).

That would be an incorrect conclusion. The reason the runtime got so dramatically shorter is that the compiler observed that without that final line, there was no change to observable state as a result of that kernel code, so all the kernel code can be deleted, and the final result will be the same.

An optimizing compiler does this. So commenting out code to estimate performance is fraught with peril.

This might or might not be happening here. You haven’t provided a complete code or a well-defined test case, so I can’t be sure of anything. If you’re convinced its irrelevant, just ignore my comments and I will certainly stop responding.

geohei · September 13, 2023, 12:22pm

Thanks for your answer.

I know that code, which is never called/used, is removed by the nvcc at compile time since it is useless. Much calculation without any result evaluation will be detected by the compiler and disregarded. In my case, I’m 99% sure that’s not the case, since all boils down at kernel end to the a[] != b[] compare operation - in all shown kernel scenarios. The performance differences must come from the code inside the if() == true block. That’s what I tried to show with my code excerpts.

Besides that, all ideas/suggestions, from you or anybody else (1.) triggers review/re-design of existing code and (2.) possibly new ideas/approaches, which might lead to the solution. So every comment is useful, even if it doesn’t provide immediate solution. So please keep on going …

The complete functioning code is too exhaustive. I will try to produce a fully functioning reduced version if I don’t solve the issue pretty soon, but it will be complicated and still be lengthy. Meanwhile I try with specific examples pointing out (strange) code behavior, hoping that this might ring a bell what could be the culprit?

Look at the following observations:

	uint32_t i, j;
	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			hits[0].tid = tid;     // (ex.1) 33.3 Gops/s
//			hits[0].tid = tid + 1; // (ex.2) 33.3 Gops/s
//			hits[0].tid = tid + i; // (ex.3) 32.4 Gops/s
//			hits[0].tid = tid + j; // (ex.4) 30.9 Gops/s
//			hits[0].tid = 1;       // (ex.5) 32.8 Gops/s
//			hits[0].tid = i;       // (ex.6) 32.2 Gops/s
//			hits[0].tid = j;       // (ex.7) 30.7 Gops/s
		}
	}

ex.1 vs ex.5: why is setting a constant number faster than accessing variable (register)?
ex.6 vs ex.7: i and j are both uint32_t in a register. Why is one faster than the other?

I believe it’s all about a memory access problem.

njuffa · September 13, 2023, 5:31pm

I would suggest using the CUDA profiler to pin-point the bottleneck(s).

geohei · September 13, 2023, 8:22pm

:)

I have no clue where to start considering my “issue”?

Robert_Crovella · September 14, 2023, 12:58am

A general, clean-slate approach for using the profiler is part of an analysis-driven optimization approach. These blogs 1 2 may shed some light on that. I understand you may not want to start with a clean slate. However, I’m not able to offer any more direction than that. I don’t understand what your “issue” is, and the dialog so far has not been enlightening for me personally. Nothing comes to mind (other than beginning a structured approach to performance analysis).

geohei · September 15, 2023, 8:53am

I’m in progress of checking the blogs 1 and 2 above. Will take some time to digest.

Meanwhile, I try to understand why code, which is virtually never executed (the if() statement goes true only about once a day) reduces performance by approx. 25%? I understand that during branch divergence, the path separates which reduces the performance, but in my case, the code inside the branch is (1.) hardly ever executed and (2.) very short (an atomicAdd operation and a parameter save operation into a struct).

Articles below are only partially applicable, but the answers/comments are very interesting to understand branching and how the compiler deals with registers. However nothing helped to increase performance. So you know what I tested and excluded.

I checked the SASS code for the if() statement below, suspecting global memory access. But by I believe (please confirm) that variable a and b are stored in registers ?!

if (a[i] != b[j])
    continue;

5628	00000013 0204b5b0	     BMOV.32.CLEAR RZ, B0 
5629	00000013 0204b5c0	     BSSY B0, 0x130204b760 
5630	00000013 0204b5d0	     ISETP.NE.U32.AND P6, PT, R6, R131, PT 
5631	00000013 0204b5e0	     ISETP.NE.U32.AND P5, PT, R6, R123, PT 
5632	00000013 0204b5f0	     ISETP.NE.AND.EX P6, PT, R7, R129, PT, P6 
5633	00000013 0204b600	     ISETP.NE.U32.AND P4, PT, R6, R119, PT 
5634	00000013 0204b610	     ISETP.NE.U32.AND P3, PT, R6, R101, PT 
5635	00000013 0204b620	     ISETP.NE.U32.AND P2, PT, R6, R127, PT 
5636	00000013 0204b630	     ISETP.NE.U32.AND P1, PT, R6, R113, PT 
5637	00000013 0204b640	     ISETP.NE.U32.AND P0, PT, R6, R103, PT 
5638	00000013 0204b650	@P6  BRA 0x130204b750

Besides that, I managed to increase performance by 5% by using #pragma unroll 1 for the first loop. Perhaps this is a hint about what happens behind the scene.

#pragma unroll 1
	for (i = 0; i < 2; i++) {
		for (j = 0; j < 32; j++) {
			if (a[i] != b[j])
				continue;
			int hitsc = atomicAdd(hitsn, 1);
			hits[titsc].tid = tid;			
		}
	}

BTW … this is what the compiler tells me:

1>ptxas info    : Function properties for _Z16mantas_kerneliP4HitsPjPy
1>    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 168 registers, 352 bytes cmem[0], 20 bytes cmem[2]

cbuchner1 · September 15, 2023, 2:36pm

I wonder if you could get a speed-up by limiting the kernel to 128 registers. You would see some spill loads and stores, but your occupancy may rise.

geohei · September 15, 2023, 2:59pm

Good idea, but performance dropped by 36%.

1>ptxas info    : Function properties for _Z16mantas_kerneliP4HitsPjPy
1>    192 bytes stack frame, 284 bytes spill stores, 300 bytes spill loads
1>ptxas info    : Used 128 registers, 352 bytes cmem[0], 36 bytes cmem[2]

njuffa · September 15, 2023, 9:53pm

Generally speaking, the heuristics used by the CUDA compiler result in good choices when trading off occupancy against register usage. In my experience, manually squeezing the register count below the limit chosen by the compiler by more than 2 or 3 registers will lead to a decrease in performance most of the time. In other words, reducing from 131 to 128 registers may prove beneficial for performance, but reducing from 168 to 128 registers is exceedingly unlikely to do so.

geohei · September 16, 2023, 4:48am

The 168 registers (peak) are used only prior the if() condition (this is confirmed by Nsight Compute), which is at the very end. The performance drop resides inside the result evaluation code, which does not very much (see above). Therefore I believe it’s the memory access of b[] which slows down the code. a[] is hardcoded by nvcc as you can see inside the SASS excerpt. But b[] I see in registers (see SASS above), which should be fast.

I also tried to avoid the if() condition completely, but this slows down the code even further.

Strange also that the #pragma unroll 1 directive on first loop increased performance by 5%. I would have thought it should be the other way around.

njuffa · September 16, 2023, 5:07am

The point of maximum register usage dictates how many registers need to be allocated for the entire kernel, independent of whether this point is in a frequently-used part of the code or not. In fact, it is not unusual to have a kernel with a “slim” fast path requiring relatively few registers, and that is used the vast majority of the time, and an infrequently exercised “fat” slow path that requires more registers than the fast path because it is responsible for the complex handling of gnarly special cases.

It is not advisable to make such assumptions. Look at the generated machine code (SASS) and use the CUDA profiler to explore. One scenario where a fully rolled loop could be faster than a partially-unrolled loop is when the size of the latter exceeds the size of the rather small instruction cache (keep in mind that every machine instruction occupies 8 bytes and is coupled with another 8 bytes of operation-steering data; a single instruction thus requires 16 bytes).

Topic		Replies	Views
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2419	October 25, 2017
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1362	March 25, 2022
Used Registers vs Live Registers CUDA Programming and Performance	14	3380	June 28, 2020
Question about : Kernel optimization , ptaxs register usage, branch divergence, warm up kernel runs CUDA Programming and Performance hw , cuda	5	331	May 7, 2024
Performance penalty due to warp divergence CUDA Programming and Performance	9	1590	May 18, 2023
Why compiler prefer to use registers to cache hot data rather than constant memory? CUDA Programming and Performance	22	1515	November 7, 2022
A block size less than 32? CUDA Programming and Performance	37	7932	December 17, 2018
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	458	March 3, 2024
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8443	March 1, 2015
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	325	April 3, 2024

Uint64_t result evaluation & storage eats up 25% of kernel performance

Related topics