Finding performance bottlenecks

I was wondering if anyone has thought up some good strategies in finding bottlenecks in CUDA applications. I am trying to come to some sort of conclusion based on test results, but am having trouble understanding the results.

For our particular application, we are accessing a read only dataset of 24MB stored as a 1D texture. For each thread block we need an additional 16MB to store the results. Memory reads are only coalesced for about 1/3 of the access patterns in the dataset, since one access pattern is mapped to one multiprocessor memory reads are coalesced for about 1/3 of all reads. The code uses 21 registers, so we are able to run 384 threads on a multiprocessor at once or 192 threads with 2 blocks per multiprocessor.

We get the highest performance when running 16 thread blocks with 384 threads each. Decreasing the number of threads from 384 to 320 per multiprocessor yields a very small performance loss. Performance drops 10% when using 32 blocks with 192 threads. Interleaving two thread blocks on one multiprocessor apparently does not offer any advantages, but gives a overhead. Can I conclude from this that memory bandwidth is not a bottleneck?

Performance per multiprocessor is two-and-a-half time as high as when running only one thread block on the GPU. As far as I can tell, the only thing shared between the multiprocessors is the memory bus. So if memory is not a bottleneck, how can running fewer thread blocks yield a higher performance per block?

When we disable all reads and writes from and to memory in a way the compiler does not see (so things are not optimized away), we get only a very small performance increase (10%) compared to doing all memory read and write actions. Increasing the dataset size to 48 MB to enable coalesced reads for 2/3 of reads yields no performance increase, but actually costs some performance since we have to upload additional data to the GPU.

We are very happy with the performance gain we have already had over our CPU version, but would like some certainty that this is all we can get out of our CUDA port. Any tips and/or information would be sincerely appreciated.