I am trying to run a kernel with grid (500,1,1) and block (500,1,1). When ran in hardware the code completes but I’m not getting the expected results when I copy the results from the device memory. I switched to the device emulation. Now, only 8 threads can pass through a __synthreads() barrier and rest of them are stuck on synthreasds. There is a piece a code before __synthreads() which copies some stuff from global memory to shared memory. If I comment out the _synthreads() statement then only those 8 threads go forward and produce results. I can be more specific with the code but I’m not clear why only 8 threads will complete with __synthreads() commented out. How many threads will CUDA generate, isn’t it 500*500? In gdb I can all those threads being created but only 8 of’em exits.
Thanks for any input.