Can anyone that’s used the Histogram 256 example verify if increasing the number of blocks in the grid causes the program to crash on the histogram256Kernel?
I’ve increased BLOCK_N from 64 to 1024 and seen it die with a ULF. I just want to make sure it’s not just me - I’m using CUDA 2.0b on the 8800GT in Windows.
Is there a reason why BLOCK_N = 64 was chosen / is there a reason why BLOCK_N = 1024 would die?
Been some time since I checked the histogram_256 code (and I am not at my Cuda machine currently), but if I remember right, you could turn on Atomics (if not already). [You will also have to add the -sm11 option to the nvcc compile command]
With atomics on, you dont hit the second “merge” kernel.