Code works on GTX285, but not on GTS250

Hi All,

I have developed a graph cuts program using cuda3.0 on GTX285 with compute ability 1.3. and It works well.

However, when I tried to run the identical program on another computer with GTS250, it failed. I didn’t go to debug mode but it sounds to me that the kernel didn’t execute at all. (Because usually my kernel takes 10mins to finish on 285, but on 250, it returns all zero immediately)

It seems that there are some hardware differences lying between the two card that are causing the problem. So far as I know, both of the two cards support atomic operations. and the compute ability of GTS250 and GTX285 are 1.1 and 1.3 respectively.

Anyone can shed some light on this problem?

By the way, I used atomic operation and didn’t use double type, just float and int.

Thanks in advance!

Hello,

[EDIT : this is a mistake, don’t consider this message ] :)
1.1 and 1.3 cards have differents shared memory amount. You might check the third argument value in your kernel calls <<< X, Y, shared_amount >>>

No they don’t - 16kb per multiprocessor in each case. It might, however, be shared memory atomic support that is the problem. Shared memory atomic intrinsics are only available on >= compute 1.2 devices.

No they have not - both have 16 KB of shared memory per SM. But Only 1.3 devices have shared memory atomics.

Eddie, did you compile your code for sm_11 devices?

Oops, I was thinking about registers… but it doesn’t affect execution. Sorry!

The register file on compute 1.1 device is smaller (8192 vs 16384). You might try to launch too many threads for the number of registers that is available, so the kernel won’t start.

i.e. launching 512 threads @ 32 registers each work on a compute 1.2 device and higher, but not on a compute 1.1 device.

I think cbuchner might be right. Now each block contains 512 threads, which might be too many for sm11 cards.

I’ll post my experiment results with fewer threads in a block later today.

Thanks a lot