[Edit]
Short summary, original below.
Writing immediates directly to memory in a vector store operation seems to produce inefficient code, which runs exceptionally bad when the destination is a pinned host memory.
outval.x = 255; // <<< PROBLEM
outval.y = (uchar)(in_val.x * 255.0f); //no problem
....
output[(y + j)*width + x] = outval;
Workaround: Replace the immediate with a kernel argument or __constant__
variable
[Original]
Hello,
I have a question about some strange behavior I encountered recently when using pinned memory and immediate values.
My small kernel was supposed to do some calculations in a streaming like manner and write the results to pinned memory. I had to struggle with some serious performance issues until I found out it has nothing to do with the calculations, memory loading, indexing (is contiguous and aligned) or else (actually commenting all the code completely out makes it even worse).
The symptom was:
- Not the greatest performance in Debug build (as expected)
- Extremely bad performance in Release build (two orders of magnitude worse than Debug build. A small µC would have outperformed the RTX this is currently running on - not as expected)
- No obvious information from Nsight Compute, except that warp cycles per executed instruction shooted to the moon (and everything else goes to zero of cause). In Debug build the profiling results were comparable with a simple copy kernel (my kernel is memory-bandwidth bound, too).
The kernel looks, in short, somewhat like this (self contained example in answer below):
__global__ void
test_kernel(uchar4 *output, // <- pinned memory
float3 const * __restrict__ data )
{
int x = blockIdx.x * TILE_DIM + threadIdx.x;
int y = blockIdx.y * TILE_DIM + threadIdx.y;
int width = gridDim.x * TILE_DIM;
#pragma unroll
for(int j = 0; j < TILE_DIM; j += BLOCK_ROWS) {
// Load data, sample textures, do some calculations
// ....
uchar4 outval;
outval.x = 255; // <<< PROBLEM
outval.y = a;
outval.z = b;
outval.w = c;
output[(y + j)*width + x] = outval;
}
}
The reason for these symptoms was the immediate value in the initialization of outval in combination with pinned memory. If either outval was device memory or the immediate is replaced for a fetch from constant memory, some other variable or kernel parameter, then the performance is fine (limited by PCIE bandwidth).
This is not the only immediate in my code, however this particular one is the only one causing problems.
Looking at the SASS instructions I noticed the instruction for the immediate is actually missing completely in Release build. In Debug build its not optimized out and the program runs fine.
So my questions are:
- Where is the immediate gone?
- Why does this little immediate cause such an abysmal performance after optimization?
Additional infos:
Test Devices: RTX2060 and Jetson TX2
Compute Cap: 7.5 and 6.2
Cuda Toolkit: 10.2, Builder: 10.1
Thanks in advance!
Patrick