Thread concurrency understanding validation

Hi all,

As a newbee in GPGPU world, I’d like to validate my understanding with you.
My goal is to maximise parallell computation.
I have a GeForce GTX 850M. Thanks to the deviceQuery source example,I have 5 Multiprocessors running 128 Cuda Cores each so 640 cuda cores.

My understanding,

For a given instruction, 640 threads can execute it at the same time.
Correct ?

As the warp is the minimum group of threads, I can run (640/32) 20 blocks of 32 threads at the same time. Then I can run 20 streams of 32 thread at the same time
Correct ?

As streams can run concurrently, I can have 20 streams running 20 different kernels (1 kernel / steam) at the same time.
Correct ?

Thnaks you all in advance.

Regards

All of these questions are answered in various places on these forums and around the web, and/or in the CUDA programming guide.

However none of them are the right way to start out thinking about CUDA programming.

Do not worry about the parallel width of your machine, especially if you are a newcomer to CUDA. Write CUDA programs that consist of lots of threads, like 20,000 threads or more. Those codes will run nicely on your GTX 850M.

Don’t start out by using streams to attempt to fill your machine. Streams are used to arrange concurrency. Don’t use concurrent kernels as a way to fill the machine, if you can avoid it. You do that as a last resort.

Thanks!