Strange behavior when overlapping transfer and kernel execution!

I am trying to overlap data transfer over PCIE and kernel execution by splitting input and output data into small chunks, so that each kernel only process one chunk. The kernel execution time almost equals transfer time to plus from host mem. Thus, I suppose by cutting input data into smaller enough chunks, total time(including data transfer) can be cut in half. However, I found that when data are cut into chunks equal to or smaller than 1MB, total execution time increases dramatically:

chunk size(MB)------>total time(ms)
64------------------->101
32------------------->78
16------------------->66
8-------------------->61
4-------------------->59
2-------------------> 56
1------------------->105
0.5------------------>105

I have tried different total input size, and execution time increases all at 1MB, so I assume it has nothing to do with number of chunks, but chunk size. Does the driver of my hardware have problems overlapping kernel and transfer smaller than 1MB?
I am using 9800GX2, CentOS 5.2, driver 2.3. Can anyone tell what’s happening here?
P.S. I have posted a similar question in another section, but no one answers it. If this violates the rule, please do not delete this post.

DMA transfers to and from the device incur some startup cost. If the amount of data being transferred is not large enough to amortize the startup cost, then transferring many small segments will introduce a significant overhead. If I remember correctly, we did a few experiments with a 280GTX that shows that you needed to transfer around 1MB to ensure that the startup cost of transferring data was less than 5% of the total transfer time.

I see the point, and I am expecting it. However, from 56ms to 101ms is obviously not caused by startup overhead, at least not the major reason. On my own machine, transfering 1MB almost has the same bandwidth with that of 10MB or more.

I suspect that maybe there is some scheduling overhead involved in streams. Plus, when overlapping execution and transfering, the total time seems unstable, with large deviation, espacially with newer drivers. Can some NVIDIA people explain?

Thanks for you reply anyway!!

I solved the performance deviation problem. It has something to do with cpu-thread affinity, refering to http://forums.nvidia.com/index.php?showtopic=104243

I downloaded the 3.0 beta driver, and porfermance seems to be improved to some extent:

chunk size(MB)------>total time(ms)

64------------------->101

32------------------->78

16------------------->66

8-------------------->61

4-------------------->59

2-------------------> 56

1-------------------> 63

0.5------------------>79

64KB---------------->157

However, this is still not correct. For 1MB and 0.5MB the total time increased too much. Here is the result when not overlapping:

chunk size(MB)------>total time(ms)

64------------------->101

32------------------->101

16------------------->101

8--------------------> 101

4--------------------> 101

2-------------------> 102

1-------------------> 103

0.5------------------> 104

64KB---------------->129

When using 64KB chunks, overlapping is slower than not overlapping.

Could some insider help??? It’s really important!!