Btionic Sort in sdk Demo what's the bottleneck?

the bitonic sort Demo in cuda-sdk can only sort 512 int elements at most. why?

512 int only take 4 * 512 = 2048 bytes men, and there’s 16384 bytes share mem.

there can be 65535 threads at most in a grid,but 512 bitonic sort only take 512 threads.

see the DeviceQuery result:

Device 0: “GeForce 8800 GT”
Major revision number: 1
Minor revision number: 1
Total amount of global memory: 536150016 bytes
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1512000 kilohertz

because the shared memory can only be accessed by the threads of 1 block. And the maximum amount of threads per block is 512.

Hi, i’m learning about CUDA and i’m interesting in Bitonic Sort. how can i solve this problem? i want to modify the bitonic sort example and sort more than 512 elements. Thanks.

One simple (although not very efficient) method is to use multiple passes of the per-block bitonic sort, and offset the start of the blocks by half the block size on the even passes. The offset allows communication across the block boundaries. This is a kind of hybrid odd-even bitonic sort.

Anyway, I would recommend reading about parallel sorting networks.