Cuda Texturing Performance vs DX11

Hi again,

I’ve tested cuda texturing perf vs DX11 texturing perf:

CUDA : 14 fps (RGBA32F, 3200 texture lookups/(pixel*frame), 200 writes per frame, 512x512 pixels)

DX11 : 30 fps (RGBA32F, 3200 texture lookups/(pixel*frame), 200 writes per frame, 512x512 pixels);

Note: each thread/shader reads 16x the (almost) same pixel location. Exes attached. Source attached for CUDA (Modified SimpleDX11 texture sample.)
Note2 : I’ve tried using the function cudaFuncSetCacheConfig() without successful results.
Note3 : Cuda is also slower with R32F, but the difference is smaller.
Note4 : Same results when the texture is allocated with DX11 and shared (through cudaArray).
Note5 : GTX 570, Win7 x64, Cuda 4.0.13, 270.61

Is this normal?

Octavian
DX11TexturingTest.rar (154 KB)
simpleD3D11Texture.rar (235 KB)

UPDATE : My test was biased as I was only using the first component in the DX version.

float val;

for(int i=0;i<16;i++)val += t0.SampleLevel(s0lr, tc+float2(i*.00001f,0),0)/16.f;

return val;

Had to be replaced with

float4 val = float4(0,0,0,0);

for(int i=0;i<16;i++)val += t0.SampleLevel(s0lr, tc+float2(i*.00001f,0),0)/16.f;

return val;

When replaced, I achieve similar (14-15 fps) in both DX and Cuda. My bad :P

But my actual probem is ping-pong performance using Cuda textures. I thought I had narrowed it down to texture lookups, but it seems not!

Will post code soon!

Ping Pong test:

512x512 RGBA32F ctexa0, ctexa1;

for(int i=0;i<200;i++)
{
ctexa1 = ctexa0*.9998f;
ctexa0 = ctexa1*.9998f;
}

Cuda : 58fps
DX11 : 80fps

PS : I’m working on a navier stokes fluid sim (2D, 3D) and I’m wandering if Cuda will give me the best (fastest) solution!

Octavian
DX11PingPong.rar (327 KB)
simpleD3D11Texture_PingPongCuda.rar (474 KB)