I have developed a graph cuts program using cuda3.0 on GTX285 with compute ability 1.3. and It works well.
However, when I tried to run the identical program on another computer with GTS250, it failed. I didn’t go to debug mode but it sounds to me that the kernel didn’t execute at all. (Because usually my kernel takes 10mins to finish on 285, but on 250, it returns all zero immediately)
It seems that there are some hardware differences lying between the two card that are causing the problem. So far as I know, both of the two cards support atomic operations. and the compute ability of GTS250 and GTX285 are 1.1 and 1.3 respectively.
Anyone can shed some light on this problem?
By the way, I used atomic operation and didn’t use double type, just float and int.
[EDIT : this is a mistake, don’t consider this message ] :)
1.1 and 1.3 cards have differents shared memory amount. You might check the third argument value in your kernel calls <<< X, Y, shared_amount >>>
No they don’t - 16kb per multiprocessor in each case. It might, however, be shared memory atomic support that is the problem. Shared memory atomic intrinsics are only available on >= compute 1.2 devices.
The register file on compute 1.1 device is smaller (8192 vs 16384). You might try to launch too many threads for the number of registers that is available, so the kernel won’t start.
i.e. launching 512 threads @ 32 registers each work on a compute 1.2 device and higher, but not on a compute 1.1 device.