Why is DirectCompute 2x faster than CUDA for my kernel?

I have a CUDA kernel that should be limited only by the texture fill rate of the device. Running on the GTX 460 it is only half the speed it should be. I’ve now ported the kernel to a DirectX Compute Shader 5.0 shader and it runs twice as fast (i.e. as it should do). I don’t really want to rewrite my whole application (although it is tempting just for the multi-vendor support). What can I do to find the problem in the CUDA version?

I have a CUDA kernel that should be limited only by the texture fill rate of the device. Running on the GTX 460 it is only half the speed it should be. I’ve now ported the kernel to a DirectX Compute Shader 5.0 shader and it runs twice as fast (i.e. as it should do). I don’t really want to rewrite my whole application (although it is tempting just for the multi-vendor support). What can I do to find the problem in the CUDA version?

Some possibilities:

  1. Code bug :)

  2. Different execution configuration (block size and grid size).

  3. Different fetch method, i.e. tex1Dfetch vs tex1D

  4. Different code generation

Without code example it is almost impossible to say something.

I could get 146 Gb/s or 9.1 GTexel/s with 128bit tex1Dfetch. It is very close to results posted here: http://www.hardware.fr/articles/795-4/dossier-nvidia-geforce-gtx-460.html

Some possibilities:

  1. Code bug :)

  2. Different execution configuration (block size and grid size).

  3. Different fetch method, i.e. tex1Dfetch vs tex1D

  4. Different code generation

Without code example it is almost impossible to say something.

I could get 146 Gb/s or 9.1 GTexel/s with 128bit tex1Dfetch. It is very close to results posted here: http://www.hardware.fr/articles/795-4/dossier-nvidia-geforce-gtx-460.html

Thanks for the response. It might well be a code bug but I don’t think its in my code :-).

My common code looks like this:


My CUDA code looks like this:


My DirectCompute code looks like this:


I get about 18GTexels/s on the CUDA version and about 36GTexels/s on the DirectCompute version.

Thanks for the response. It might well be a code bug but I don’t think its in my code :-).

My common code looks like this:


My CUDA code looks like this:


My DirectCompute code looks like this:


I get about 18GTexels/s on the CUDA version and about 36GTexels/s on the DirectCompute version.

Do you use cuda array or linear memory?

Do you use cuda array or linear memory?

I’m using CUDA arrays although it actually doesn’t matter since the textures are small enough that all four of them fit into the texture cache simultaneously so that I am not testing cache efficiency at all.

I’m using CUDA arrays although it actually doesn’t matter since the textures are small enough that all four of them fit into the texture cache simultaneously so that I am not testing cache efficiency at all.

I just had the chance to run the CUDA version on a GTX 470. Interestingly I get about 34GTexels/s which is what the specification says I should get. So this is definitely a GF104 problem. Which is annoying because it means the chances of nVidia actually fixing it are probably about zero.

I just had the chance to run the CUDA version on a GTX 470. Interestingly I get about 34GTexels/s which is what the specification says I should get. So this is definitely a GF104 problem. Which is annoying because it means the chances of nVidia actually fixing it are probably about zero.

Your code seems to be broken. If different threads write the same memory location(via ptr pointer), you must use atomic functions to avoid race conditions. “Volatile” doesn’t help here. It only inhibits compiler optimizations.

Your code seems to be broken. If different threads write the same memory location(via ptr pointer), you must use atomic functions to avoid race conditions. “Volatile” doesn’t help here. It only inhibits compiler optimizations.

Different threads don’t write to the same memory location. Each thread reads/writes to its own set of memory locations. In fact these are also perfectly coallesced locations too. The purpose of the “volatile” keyword is only to prevent caching of this memory because I know in advance it would only cause thrashing.

Different threads don’t write to the same memory location. Each thread reads/writes to its own set of memory locations. In fact these are also perfectly coallesced locations too. The purpose of the “volatile” keyword is only to prevent caching of this memory because I know in advance it would only cause thrashing.

Looks like I’m pretty much stuck with porting the whole application to DirectCompute.

Looks like I’m pretty much stuck with porting the whole application to DirectCompute.

That was wierd. Yesterday, I got an offer from “njuffa” from nVidia to look over a couple of standalone test applications. Today that offer has suddenly been withdrawn with the following message:

“I jumped the gun with my message. In looking over the posted code with a compiler engineer, we determined that the issue is not one of code generation. I am afraid I am unable to pursue the issue further than that, as this is outside of my area of responsibility.”

I take that to mean that they know exactly what the problem is, that its at their end and that they won’t discuss it.

I’ll be ordering my new AMD Radeon 6870 shortly.

That was wierd. Yesterday, I got an offer from “njuffa” from nVidia to look over a couple of standalone test applications. Today that offer has suddenly been withdrawn with the following message:

“I jumped the gun with my message. In looking over the posted code with a compiler engineer, we determined that the issue is not one of code generation. I am afraid I am unable to pursue the issue further than that, as this is outside of my area of responsibility.”

I take that to mean that they know exactly what the problem is, that its at their end and that they won’t discuss it.

I’ll be ordering my new AMD Radeon 6870 shortly.