[Solved] Performance drop when writing immediates to pinned memory

tunxten · March 31, 2021, 8:16pm

[Edit]
Short summary, original below.
Writing immediates directly to memory in a vector store operation seems to produce inefficient code, which runs exceptionally bad when the destination is a pinned host memory.

outval.x = 255;    // <<<  PROBLEM
outval.y = (uchar)(in_val.x * 255.0f);  //no problem
....
output[(y + j)*width + x] = outval;

Workaround: Replace the immediate with a kernel argument or __constant__ variable

[Original]
Hello,

I have a question about some strange behavior I encountered recently when using pinned memory and immediate values.

My small kernel was supposed to do some calculations in a streaming like manner and write the results to pinned memory. I had to struggle with some serious performance issues until I found out it has nothing to do with the calculations, memory loading, indexing (is contiguous and aligned) or else (actually commenting all the code completely out makes it even worse).

The symptom was:

Not the greatest performance in Debug build (as expected)
Extremely bad performance in Release build (two orders of magnitude worse than Debug build. A small µC would have outperformed the RTX this is currently running on - not as expected)
No obvious information from Nsight Compute, except that warp cycles per executed instruction shooted to the moon (and everything else goes to zero of cause). In Debug build the profiling results were comparable with a simple copy kernel (my kernel is memory-bandwidth bound, too).

The kernel looks, in short, somewhat like this (self contained example in answer below):

__global__ void
test_kernel(uchar4 *output,          // <- pinned memory
		float3  const *	__restrict__ data )  
{

	int x = blockIdx.x * TILE_DIM + threadIdx.x;
	int y = blockIdx.y * TILE_DIM + threadIdx.y;
	int width = gridDim.x * TILE_DIM;

#pragma unroll
	for(int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {

		// Load data, sample textures, do some calculations
		// ....

		uchar4 outval;
		outval.x = 255;    // <<<  PROBLEM
		outval.y = a; 
		outval.z = b;
		outval.w = c;

		output[(y + j)*width + x] = outval;
	}
}

The reason for these symptoms was the immediate value in the initialization of outval in combination with pinned memory. If either outval was device memory or the immediate is replaced for a fetch from constant memory, some other variable or kernel parameter, then the performance is fine (limited by PCIE bandwidth).

This is not the only immediate in my code, however this particular one is the only one causing problems.

Looking at the SASS instructions I noticed the instruction for the immediate is actually missing completely in Release build. In Debug build its not optimized out and the program runs fine.

So my questions are:

Where is the immediate gone?
Why does this little immediate cause such an abysmal performance after optimization?

Additional infos:
Test Devices: RTX2060 and Jetson TX2
Compute Cap: 7.5 and 6.2
Cuda Toolkit: 10.2, Builder: 10.1

Thanks in advance!
Patrick

njuffa · March 31, 2021, 8:47pm

Consider expanding your code snippet into a minimal self-contained program that reproduces the issue and that others can compile and run.

tunxten · March 31, 2021, 10:06pm

Here’s a small example (see below for source file). When building and running (im using Nsight Eclipse 10.2 (Release Build active) / Cuda Toolkit 10.2 ) using following compiler / linker options:

-lineinfo -O3
–cudart static --relocatable-device-code=true -gencode arch=compute_75,code=compute_75 -gencode arch=compute_75,code=sm_75

The program has two precompiler defines which can be used to toggle the different behaviors:
#define PINNED_MEMORY
#define USE_IMMEDIATE

I get the following output for different configurations:

deactivated: PINNED_MEMORY, irrelevant: USE_IMMEDIATE

Device : GeForce RTX 2060
Matrix size: 4096 3072, Block size: 32 8, Tile size: 32 32
dimGrid: 128 96 1. dimBlock: 32 8 1

Benchmark: 0.892637[ms] 1120.28[cycles per second]
activated: PINNED_MEMORY, deactivated: USE_IMMEDIATE (-> PCIE performance)
…
Benchmark: 4.54064[ms] 220.233[cycles per second]
activated: PINNED_MEMORY, activated: USE_IMMEDIATE (!! should be PCIE performance)
…
Benchmark: 234.881[ms] 4.25748[cycles per second]

main.cu (3.8 KB)

njuffa · March 31, 2021, 10:17pm

I have three different GPUs here. With the first one, I see identical performance for all three build configurations. Let me try the other two.

[Later:]

On my Quadro RTX 4000 I see behavior which essentially matches OP’s observations.

[Later:]

Using CUDA 11.1 I am able to reproduce the reported behavior with Pascal (sm_61) and Turing (sm_75) GPUs. I wasn’t able to repro on a Kepler-family GPU, but I am using a different host system and CUDA version for that.

Quadro
P2000      PINNED MEMORY  USE_IMMEDIATE milliseconds
                  0              0          1.77
                  0              1          1.88
                  1              0          4.21
                  1              1         87.33
Quadro
RTX 4000   PINNED_MEMORY USE_IMMEDIATE milliseconds
                  0              0          0.57
                  0              1          0.57
                  1              0          4.20
                  1              1         49.64

Will look at the code now to see what it’s measuring.

Robert_Crovella · March 31, 2021, 10:29pm

I’m able to witness the behavior on a V100. I get a benchmark time of ~4ms in case 2 and ~91ms in case 3. A quick look at the SASS code suggest to me that the slow case is writing out the uchar4 quantity with 4 independent STG.E.U8.SYS instructions whereas the fast case is writing the uchar4 with a single STG.E.SYS instruction. Given that the target here is PCIE, I think this could have a substantial effect on performance. I’m on CUDA 11.1

Robert_Crovella · March 31, 2021, 10:41pm

No workarounds (other than what you have already) immediately suggest themselves to me. Your coding looks reasonable and I can’t explain the compiler behavior. My suggestion would be:

Re-verify the observation on the latest cuda version available.
If the observation still is evident on the latest cuda version, file a bug using the sticky post instructions at the top of this sub-forum.

tunxten · March 31, 2021, 10:41pm

I had the opportunity to try it on a very old GT 750 M now. Cuda 10.2, too and it seems to have equal runtime for all configurations as njuffa reported. The time measurements don’t seem to be right, though.

Interestingly looking at the SASS, I get a single STG.E.SYS + 3 independent STG.E.U8.SYS in case 3 on the RTX2060 (instruction for the immediate seems to be missing)

[Edit]
Apparently the old system running the 750 [sm_35] didn’t work correctly. After restart it showed the reported behavior

Robert_Crovella · March 31, 2021, 10:43pm

Yes, I mis-spoke/mis-observed. That is what I see also.

njuffa · March 31, 2021, 11:07pm

If you look at the disassembly closely, you can see that the initial 32-bit STG.E writes 0x000000FF, then the next three STG.E.U8 write out the upper three bytes individually.

Even weirder: The 0x000000FF is actually retrieved from constant memory c[0x2][0x0], gets sent through a BFI instruction to extract the least significant byte (resulting in 0x000000FF yet again) before it gets written out with the STG.E.

Somehow the compiler seems very confused here. It looks like a struggle between constant propagation and vectorized loads/stores (wild speculation).

No matter what target architecture I specify (I tried from sm_35 up), the observed behavior stays the same when running with sm_75 hardware. This would suggest the root cause of this issue is in the compiler backend ptxas, which compiles PTX into machine code (SASS).

The reason this code generation issue is visible as a performance anomaly only when using pinned memory is that the GPU memory subsystem generally merges small writes into wide buffers which then get written out to GPU memory in large chunks. Based on my numbers above, on Pascal the hardware cannot completely neutralize the inefficiency of the software behavior, as there is a very small but reproducible increase in execution time with USE_IMMEDIATE.

tunxten · March 31, 2021, 11:50pm

If you look at the disassembly closely, you can see that the initial 32-bit STG.E writes 0x000000FF, then the next three STG.E.U8 write out the upper three bytes individually.

Looking at my SASS I see the c[0x2][0x0] load. However its not doing a BFI and instead a whole lot of PRMT, LOP3, … as if it’s trying to construct the number out of nowhere

Out of interest I changed uchar4 to uint4. Now it simply writes the immediate 0xff to memory. However it does STG.E.128.SYS, STG.E.64.SYS and STG.E.SYS one after another.
Changing to uchar and writing every byte individually (so no vector store anymore) does four separate STG.E.U8, but again it does weird things with the immediate (two constant memory fetches and a PRMT). I’m not too much into SASS, however it does not look right to me.

The reason this code generation issue is visible as a performance anomaly only when using pinned memory is that the GPU memory subsystem generally merges small writes into wide buffers which then get written out to GPU memory in large chunks. Based on my numbers above, on Pascal the hardware cannot completely neutralize the inefficiency of the software behavior, as there is a very small but reproducible increase in execution time with USE_IMMEDIATE .

I guess that’s true. The code does indeed not change depending on memory type, which makes sense I guess.

njuffa · March 31, 2021, 11:57pm

Turns out I was describing the sm_61 machine code (too many windows open). And of course BFI is bit-field insert, not extract. PRMT operates at the byte-level so can also be used to do byte-wise inserts and may result in better performance on some architectures.

It is good that you already identified a workaround, and filing a bug should not take long given that you already have a great repro app.

tunxten · April 1, 2021, 1:07am

I tested it now with Cuda 11.2, same problem again. Bug report is filed.
Thanks for the great help!

njuffa · April 1, 2021, 8:36am

@tunxten You are welcome.

Unfortunately I goofed when I wrote that the problem seems to be in the compiler backend ptxas. The fact that the problem occurs regardless of target architecture obviously points at a root cause in an architecture independent part of the compiler, so not ptxas. In fact, the problem is readily visible in the PTX output produced by the nvvm part of the compiler:

st.global.v4.u8 [%rd12], {%rs1, %rs5, %rs6, %rs7};
st.global.u8 [%rd12+1], %r15;
st.global.u8 [%rd12+2], %r16;
st.global.u8 [%rd12+3], %r17;

tunxten · April 1, 2021, 5:04pm

Yes, I noticed that, too, when I looked up the PTX to add some example to the bug report. However in the PTX the immediate is simply hardcoded, moved to rs1 and written to memory directly, whereas in the SASS it is loaded from constant memory. Apparently this depends on the implementation. When using uint4 as output, the immediate is also hardcoded in SASS, as I wrote earlier. I assume the immediate should always be hardcoded as this is the most efficient way. Hence I think something might be odd with the backend, too, or am I misunderstanding something there?

njuffa · April 1, 2021, 5:32pm

Note that NVIDIA’s bug database is set up for confidentiality, so one cannot share bug reports. Only the filer and relevant NVIDIA personnel have access. This protects sensitive information that could be revealed through code appended to bug reports. And even the simple knowledge that someone is using GPUs could be sensitive information in some circumstances.

For what it is worth, this vector-store code generation bug does not exist in PTX produced by CUDA 9.2, which is why I was unable to reproduce it on my older system with the Kepler-based GPU. Since you observe it with CUDA 10.2, it looks like it was first introduced in CUDA 10. Maybe (speculation) NVIDIA performed a re-base of nvvm to a newer version of llvm at that time ? At least this is how some other errors have crept into nvvm historically.

As for the storing of literal constants: As long as access is uniform across the warp, there is usually no performance difference between sticking them into a constant memory bank or the immediate field on an instruction. This assumes that relevant constants fit into the constant cache, which is something like 4KB or 8KB.

However, depending on architecture, the immediate field of instructions may not be able to hold a full-sized value (e.g. some single-precision constant fields on pre-Volta architectures can only hold the 20 leading bits), and there may not always be an immediate field for some operands of an instruction. On the other hand, constant bank storage is limited, and the constant cache that allows constant bank data to be accessed at the speed of an immediate operand is even smaller, so ptxas presumably first tries to stick constants into an immediate field, and where that isn’t possible, puts the constant into a constant bank.