I am trying to overlap data transfer over PCIE and kernel execution by splitting input and output data into small chunks, so that each kernel only process one chunk. The kernel execution time almost equals transfer time to plus from host mem. Thus, I suppose by cutting input data into smaller enough chunks, total time(including data transfer) can be cut in half. However, I found that when data are cut into chunks equal to or smaller than 1MB, total execution time increases dramatically:
chunk size(MB)------>total time(ms)
64------------------->101
32------------------->78
16------------------->66
8-------------------->61
4-------------------->59
2-------------------> 56
1------------------->105
0.5------------------>105
I have tried different total input size, and execution time increases all at 1MB, so I assume it has nothing to do with number of chunks, but chunk size. Does the driver of my hardware have problems overlapping kernel and transfer smaller than 1MB?
I am using 9800GX2, CentOS 5.2, driver 2.3. Can anyone tell what’s happening here?
P.S. I have posted a similar question in another section, but no one answers it. If this violates the rule, please do not delete this post.