gtx 465 performance

Magorath · August 18, 2010, 9:44am

I’m also testing my code on different devices and I found out that compiling my kernel with sm_12 on my GTX 465 leads to code more than 2.2 times faster than when I compile it with sm_20.

stefan.tabaranu · August 18, 2010, 10:20am

That’s really insteresting, I’ll try that as well. Initialy I was thinking that maybe it has fewer SM and if the code diverge it will take more time for the SM to emit instructions. But I have a piece of code that has no divergent branches and it’s still less performant than 260.

As a reference I started to play with the samples from the SDK. On your 465, Magorath, what results do you get for the samples that I mention?

stefan.tabaranu · August 18, 2010, 10:20am

That’s really insteresting, I’ll try that as well. Initialy I was thinking that maybe it has fewer SM and if the code diverge it will take more time for the SM to emit instructions. But I have a piece of code that has no divergent branches and it’s still less performant than 260.

As a reference I started to play with the samples from the SDK. On your 465, Magorath, what results do you get for the samples that I mention?

cbuchner1 · August 18, 2010, 10:58am

For this reason I am sticking with sm_11/12 for now. I find that my binaries compiled with CUDA SDK 2.3 still run great on Fermi, so I won’t upgrade my production environment. I have only set up a single installation with CUDA SDK 3.1 for evaluation purposes.

cbuchner1 · August 18, 2010, 10:58am

For this reason I am sticking with sm_11/12 for now. I find that my binaries compiled with CUDA SDK 2.3 still run great on Fermi, so I won’t upgrade my production environment. I have only set up a single installation with CUDA SDK 3.1 for evaluation purposes.

shawkie · August 18, 2010, 12:53pm

Have I missed something? I thought Fermi could only run sm_20/sm_21 code and that if it wasn’t present (either fully compiled or as PTX) then it just wouldn’t work?

shawkie · August 18, 2010, 12:53pm

Have I missed something? I thought Fermi could only run sm_20/sm_21 code and that if it wasn’t present (either fully compiled or as PTX) then it just wouldn’t work?

Magorath · August 18, 2010, 1:00pm

I’m compiling with -arch=sm_12.
I guess however that some of the Fermi hardware improvements enter the game as my code is using the L1 cache (according to the cuda profiler)

Magorath · August 18, 2010, 1:00pm

I’m compiling with -arch=sm_12.
I guess however that some of the Fermi hardware improvements enter the game as my code is using the L1 cache (according to the cuda profiler)

avidday · August 18, 2010, 1:10pm

The driver should JIT recompile/translate code compiled for older architectures if run on Fermi.

avidday · August 18, 2010, 1:10pm

The driver should JIT recompile/translate code compiled for older architectures if run on Fermi.

cbuchner1 · August 18, 2010, 1:13pm

CUDA SDK 2.3 by default includes the PTX in the binaries. Earlier SDKs like 2.0 didn’t do this yet, so those older binaries won’t run.

cbuchner1 · August 18, 2010, 1:13pm

CUDA SDK 2.3 by default includes the PTX in the binaries. Earlier SDKs like 2.0 didn’t do this yet, so those older binaries won’t run.

shawkie · August 18, 2010, 2:34pm

So you are using -arch=sm_12 which is equivalent to -gencode=arch=compute_12,code=sm_12 -gencode=arch=compute_12,code=compute_12.

In other words your program contains sm_12 cubin and sm_12 ptx and when you run it on Fermi the sm_12 ptx is getting jit compiled to sm_20 cubin (elf).

It would be a bit like -gencode=arch=compute_12,code=compute_20 if that is even a legal command line. I wonder if there is any way to get at the resulting cubin (elf) to compare them (they are cached somewhere aren’t they?). I know decuda doesn’t work for elf but I think I saw mention of an alternative (don’t know how well it works).

shawkie · August 18, 2010, 2:34pm

So you are using -arch=sm_12 which is equivalent to -gencode=arch=compute_12,code=sm_12 -gencode=arch=compute_12,code=compute_12.

In other words your program contains sm_12 cubin and sm_12 ptx and when you run it on Fermi the sm_12 ptx is getting jit compiled to sm_20 cubin (elf).

It would be a bit like -gencode=arch=compute_12,code=compute_20 if that is even a legal command line. I wonder if there is any way to get at the resulting cubin (elf) to compare them (they are cached somewhere aren’t they?). I know decuda doesn’t work for elf but I think I saw mention of an alternative (don’t know how well it works).