gtx 465 performance

I’m also testing my code on different devices and I found out that compiling my kernel with sm_12 on my GTX 465 leads to code more than 2.2 times faster than when I compile it with sm_20.

That’s really insteresting, I’ll try that as well. Initialy I was thinking that maybe it has fewer SM and if the code diverge it will take more time for the SM to emit instructions. But I have a piece of code that has no divergent branches and it’s still less performant than 260.

As a reference I started to play with the samples from the SDK. On your 465, Magorath, what results do you get for the samples that I mention?

That’s really insteresting, I’ll try that as well. Initialy I was thinking that maybe it has fewer SM and if the code diverge it will take more time for the SM to emit instructions. But I have a piece of code that has no divergent branches and it’s still less performant than 260.

As a reference I started to play with the samples from the SDK. On your 465, Magorath, what results do you get for the samples that I mention?

For this reason I am sticking with sm_11/12 for now. I find that my binaries compiled with CUDA SDK 2.3 still run great on Fermi, so I won’t upgrade my production environment. I have only set up a single installation with CUDA SDK 3.1 for evaluation purposes.

For this reason I am sticking with sm_11/12 for now. I find that my binaries compiled with CUDA SDK 2.3 still run great on Fermi, so I won’t upgrade my production environment. I have only set up a single installation with CUDA SDK 3.1 for evaluation purposes.

Have I missed something? I thought Fermi could only run sm_20/sm_21 code and that if it wasn’t present (either fully compiled or as PTX) then it just wouldn’t work?

Have I missed something? I thought Fermi could only run sm_20/sm_21 code and that if it wasn’t present (either fully compiled or as PTX) then it just wouldn’t work?

I’m compiling with -arch=sm_12.
I guess however that some of the Fermi hardware improvements enter the game as my code is using the L1 cache (according to the cuda profiler)

I’m compiling with -arch=sm_12.
I guess however that some of the Fermi hardware improvements enter the game as my code is using the L1 cache (according to the cuda profiler)

The driver should JIT recompile/translate code compiled for older architectures if run on Fermi.

The driver should JIT recompile/translate code compiled for older architectures if run on Fermi.

CUDA SDK 2.3 by default includes the PTX in the binaries. Earlier SDKs like 2.0 didn’t do this yet, so those older binaries won’t run.

CUDA SDK 2.3 by default includes the PTX in the binaries. Earlier SDKs like 2.0 didn’t do this yet, so those older binaries won’t run.

So you are using -arch=sm_12 which is equivalent to -gencode=arch=compute_12,code=sm_12 -gencode=arch=compute_12,code=compute_12.

In other words your program contains sm_12 cubin and sm_12 ptx and when you run it on Fermi the sm_12 ptx is getting jit compiled to sm_20 cubin (elf).

It would be a bit like -gencode=arch=compute_12,code=compute_20 if that is even a legal command line. I wonder if there is any way to get at the resulting cubin (elf) to compare them (they are cached somewhere aren’t they?). I know decuda doesn’t work for elf but I think I saw mention of an alternative (don’t know how well it works).

So you are using -arch=sm_12 which is equivalent to -gencode=arch=compute_12,code=sm_12 -gencode=arch=compute_12,code=compute_12.

In other words your program contains sm_12 cubin and sm_12 ptx and when you run it on Fermi the sm_12 ptx is getting jit compiled to sm_20 cubin (elf).

It would be a bit like -gencode=arch=compute_12,code=compute_20 if that is even a legal command line. I wonder if there is any way to get at the resulting cubin (elf) to compare them (they are cached somewhere aren’t they?). I know decuda doesn’t work for elf but I think I saw mention of an alternative (don’t know how well it works).