Concurrent kernel execution using multiple streams

I have written a kernel and want to use streaming to achieve concurrent kernel execution as well as to overlap D2H data transfer. I observed concurrent kernel execution and D2H transfer overlapping when I use 8 streams and invoked kernel of each stream with 16 thread blocks each of size 128. However, when I increased the number of streams from 8 to 16 and reduces the number of thread blocks from 16 to b, the level of concurrency in kernel execution drops.That is, the stream’s kernels execute one after the other. Ideally, by increasing the number of streams and reducing the number of blocks should execute more kernels concurrently. For example,with 16 streams and 8 thread block of size 128 threads /streams, four kernel should be executed concurrently.Why increasing the number of streams droping the level of concurrency in kernel execution? I m using GT640 GPU. Kernel used 25 registers/threads and 1024 Bytes shared memory/thread block.