How does number of blocks of threads effect gpu performance

Hi all,
This is my first post, please excuse me for any mistakes.

 I tried to run a kernel code on a large file (200 Mb). No. of threads in a block is 256. Each thread does some computation on 16 bytes of data. I experimented with different file sizes, when I came across a interesting result:

The overall execution time of the program for file size greater than 256 Mb the execution time is around 8 times slower than for file size less than that. I am not able to comprehend it.

Here is the information about my video card:
NVS300:
global memory: 512Mb
No. of MP: 2
Maximum No. of threads per block: 512
Maximum sizes of each dimensions of a grid: 65535 * 65535 * 1

  Thanks in advance.

It’s hard to say from the little info available. One thing that comes to my mind would be TLB misses.

As you have noticed, 256MB is 6553625616, so it is just where a onedimensional grid does not suffice anymore. What grid size are you using? Can you try to run less than 256MB of data in a twodimensional grid?