512 is the upper maximum of threads in a block. In reality, it depends on the resource usage of your kernel. If you use more than 8192/512 = 16 registers in your kernel, you cannot use 512 threads per block, because a multiprocessor has only 8192 registers available.
A multiprocessor can actually support 768 threads, so if you would make your thread-block have 256 threads, you could run 3 blocks per multiprocessor (but then you cannot have more than 10 registers in your kernel (8192/768 = 10.6666)
These numbers are different for GT200, there your can have 1024 threads per MP and you have 16384 registers available.
So for your problem it is probably wise to make thread-blocks of 256. Then you use 2 blocks and do something like the followiin in your kernel:
unsigned int index = blockIdx.x * blockDim.x + threadIdx.x;
if (index < num_particles)