How can I don't create queues between threads?

Hi people, I have a Tesla S2050 and I must work with a big number of data. What is the correct number of threads for block and blocks that I have to run at the kernel for not to create queues for read between threads, since that I have 448 cores and the data number that I must process are many more? Thanks a lot!!!

Consider reading chapter 4. of CUDA C Programming Guide (for the start) and CUDA C Best Practices Guide.