need someone with compute capability 1.2

i need to run 2 programs that use atomics, but i only have compute capability of 1.0,
can anyone with compute capability 1.2 run them for me and post the result ?
it would really help me!
thanks

i have attached the filesmp2-part1a.cu mp2-part1b.cu and mp2-util.h
mp2-util.h (813 Bytes)
mp2-part1b.cu (12.4 KB)
mp2-part1a.cu (9.89 KB)

Sure, no prob. But, compute 1.2, really? I’m running it on GTX 480 (compute 2.0). Atomic ops in L2 cache run extremely fast on this hardware :)

$ nvcc -arch=sm_20 -o mp2-part1a mp2-part1a.cu 

$ ./mp2-part1a 

kernel took 15.6 ms

cpu binning took 1885.0 ms

Worked! CUDA and reference output match. 

$ nvcc -arch=sm_20 -o mp2-part1b mp2-part1b.cu 

$ ./mp2-part1b 

coarse_binning took 22.4 ms

cpu binning took 1904.9 ms

Worked! CUDA and reference output match.

By the way, I looked at your code. The binning code you have in part1a.cu is the same I use, and the fastest I’ve been able to come up with in 3 years of trying. It works great for all bin sizes, particle densities, etc… Well, at least it is the fastest on compute 2.0 hardware, the lack of L2 atomics make it quite slow on sm_1x hardware where I use a different technique.

Sure, no prob. But, compute 1.2, really? I’m running it on GTX 480 (compute 2.0). Atomic ops in L2 cache run extremely fast on this hardware :)

$ nvcc -arch=sm_20 -o mp2-part1a mp2-part1a.cu 

$ ./mp2-part1a 

kernel took 15.6 ms

cpu binning took 1885.0 ms

Worked! CUDA and reference output match. 

$ nvcc -arch=sm_20 -o mp2-part1b mp2-part1b.cu 

$ ./mp2-part1b 

coarse_binning took 22.4 ms

cpu binning took 1904.9 ms

Worked! CUDA and reference output match.

By the way, I looked at your code. The binning code you have in part1a.cu is the same I use, and the fastest I’ve been able to come up with in 3 years of trying. It works great for all bin sizes, particle densities, etc… Well, at least it is the fastest on compute 2.0 hardware, the lack of L2 atomics make it quite slow on sm_1x hardware where I use a different technique.

thanks a lot DrAnderson42 i really needed that info :D

i tried to improve part1a using Hierarchical Atomics in part1b. next i will try to use a scan operation instead of the second atomicAdd (the first one is on shared memory, the second is on global memory) it is supposed to be faster

thanks a lot DrAnderson42 i really needed that info :D

i tried to improve part1a using Hierarchical Atomics in part1b. next i will try to use a scan operation instead of the second atomicAdd (the first one is on shared memory, the second is on global memory) it is supposed to be faster

I find the global mem radix sort and/or scan methods to be slow, your mileage may vary. My implementation that supports old hardware performs a sort & scan on blocks in shared memory, then performs only as many atomic adds per block as are needed. This works for me, because I spatially sort particles so that particles near each other in index are likely to be in the same bins, thus the number of atomic adds per block is greatly reduced. For completely random particles, this method is slow.

I find the global mem radix sort and/or scan methods to be slow, your mileage may vary. My implementation that supports old hardware performs a sort & scan on blocks in shared memory, then performs only as many atomic adds per block as are needed. This works for me, because I spatially sort particles so that particles near each other in index are likely to be in the same bins, thus the number of atomic adds per block is greatly reduced. For completely random particles, this method is slow.

thanks! I really appreciate your taking the time to help

thanks! I really appreciate your taking the time to help