I have some performance questions/issues regarding following kernel code. It’s basically calculating a value (b), which needs to match two hardcoded 64-bit tokens (a[]). Struct hits is dimensioned to 20, so it can hold 20 hits per kernel call (which is sufficient). I know that code, which is never …

Uint64_t result evaluation & storage eats up 25% of kernel performance

Accelerated Computing CUDA CUDA Programming and Performance

striker159 September 12, 2023, 3:31pm 3

What kind of operations do you measure(Gop/s)?
What Gpu are you using?
Please provide a fully working example code that others could use to benchmark.

Some thoughts:

The second atomic kernel has a race condition with *hitsn. Multiple threads could see the same value.

To me, this looks like it could be implemented with Thrust, using a combination of transform iterator + copy_if

Topic		Replies	Views
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2408	October 25, 2017
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1333	March 25, 2022
Used Registers vs Live Registers CUDA Programming and Performance	14	3360	June 28, 2020
Question about : Kernel optimization , ptaxs register usage, branch divergence, warm up kernel runs CUDA Programming and Performance hw , cuda	5	326	May 7, 2024
Performance penalty due to warp divergence CUDA Programming and Performance	9	1546	May 18, 2023
Why compiler prefer to use registers to cache hot data rather than constant memory? CUDA Programming and Performance	22	1491	November 7, 2022
A block size less than 32? CUDA Programming and Performance	37	7893	December 17, 2018
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	455	March 3, 2024
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8427	March 1, 2015
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	319	April 3, 2024

Uint64_t result evaluation & storage eats up 25% of kernel performance

Related topics