_Tom
1
Hi,
i made this really simple Kernel to emulate a deferred light shader:
__global__ void deferredLigthingPassKernel
(
Normal* screenNormals,
Pixel* lightOut,
float intensity,
Color32 lightColor,
Normal lightDirection
)
{
uint i = threadIdx.x + blockIdx.x * blockDim.x;
Normal normal = screenNormals[i];
float light = saturate(dot( normal, -lightDirection ));
lightOut[i] += make_pixel(lightColor) * light * intensity;
}
The strange in this is that the global stores are NEVER coalesced, making an huge hit on performance…
this shouldn’t happen because Pixel is explicitly aligned(16), there aren’t any branches, or any scatter… actually, it’s what the GPU is designed for.
Maybe this is because screenNormals and ligthOut point actually to the same memory?
Difficult to say without knowing the fields of “normal” and “lightout”
_Tom
3
Normal is a typedef of float4; Pixel contains 4 floats so it’s identical to a float4… they should both be aligned-16.
It may be worth trying with two different arrays and running it through the profiler…
_Tom
5
I finally tried with two completely distinct buffers, and the result is still not coalesced.
Here’s the updated code:
__global__ void deferredLigthingPassKernel
(
Normal* screenNormals,
Pixel* lightOut
)
{
uint i = threadIdx.x + blockIdx.x * blockDim.x;
Normal normal = screenNormals[i];
Pixel final = make_pixel(0,0,0,0);
//do light computation for each directional light (always affecting)
#pragma unroll 2
for( uint i = 0; i < directLightNb; ++i)
{
final += directLight[i].diffuse *
saturate(dot( normal, -directLight[i].direction )) *
directLight[i].intensity;
}
__syncthreads();
lightOut[i] = final;
}
As there’s only one gld and only one gst, using the “default” index, i think this HAS to be coalesced.
Also it gives me ~14000 branches, even if the lights are always the same 2 and the loop should be unrolled.
Maybe it’s my 8600GT that can’t do these optimizations?