Hi,
I’m trying to get the best performance when using DX11 Dispatch.
I have 100x100 mesh data and I’m Dispatching 4x4 threads
Dispatch(4,4,1);
using 32x32 threadgroups.
I do not use groupshared memory
How can I ensure that I get coalesced memory access. This is a very simple compute kernel.
I’m guessing around how to ensure that memory access is aligned and coalesced.
All GPU Tools I found Intel,Nvidia,AMD they all present my a GPU time in microseconds thats nice but I have no idea if the kernel has aligned and coalesced memory access.
The AMD Shaderanalyzer presents me a Throughput value in GB/s but no matter how I access the memory the throughput from the
Shaderanalyzer stays the same.
I know your CUDA and OpenCL examples where each shader gets tested separately and the commandline tool tells things like bandwidth and coalesced access.
Is there something similar for DX11?
Do I need
GroupMemoryBarrierWithGroupSync();
here to ensure that all shaders access the memory at the same time to get coalesced memory access or is this just required to sync groupshared memory.
GroupMemoryBarrierWithGroupSync();
BufferOut.Store4(
offsetout * 4, // destination offset in bytes
write4);
GroupMemoryBarrierWithGroupSync();
// index 4
BufferOut.Store2(
(offsetout+4) * 4, // destination offset in bytes
write2);
#define groupthreads_x 32
#define groupthreads_y 32
#define groupthreads_z 1
static const int3 groupthreads = { groupthreads_x,groupthreads_y,groupthreads_z };
[numthreads(groupthreads_x, groupthreads_y, 1)]
void main( uint3 Gid : SV_GroupID,
uint3 Gtid : SV_GroupThreadID,
uint3 dtid : SV_DispatchThreadID,
uint GI : SV_GroupIndex )
{
// xy within the group
uint3 CurrentXY = uint3(GI % groupthreads_x,
GI / groupthreads_x,
0);
// xy
uint2 tid = uint2(Gid*groupthreads+CurrentXY);
uint nIndex = tid.x+tid.y*g_nMeshWidth;
if(nIndex < g_nMeshWidth*g_nMeshHeight)
{
uint nFace = tid.x + tid.y * (g_nMeshWidth-1);
uint xi = tid.x + tid.y * g_nMeshWidth;
uint xmi = xi + g_nMeshWidth;
uint offsetout = g_nOffsetOutput+nFace*6;
// write 6 uints by witing 4 + 2
// index 0
uint4 write4 = uint4(
xi,
xi+1,
xmi,
xmi
);
uint2 write2 = uint2(
xi+1,
xmi+1
);
BufferOut.Store4(
offsetout * 4, // destination offset in bytes
write4);
// index 4
BufferOut.Store2(
(offsetout+4) * 4, // destination offset in bytes
write2);
}
}