problems with using streams when overlapping transfer and kernel execution

I wrote a program to test the overlapping function of GPU. The program contains 3 parts: coping data to gpu, compute and copying data back. In order to overlap data transfer and kernel execution, I split the data into multiple parts instead of copying whole data->doing whole computation->copying whole data back. I tuned the compute kernel, so that kernel time = host->device time + device->host time. Theoretically, as the input are split into more parts, the total time will be less. Here is the result:
parts ----------->total time(ms)
1 --------------->101
2 --------------->78
4 --------------->66
8 --------------->61
32--------------> 56
It was fine below 64. However, when parts reaches 64, the time increases weirdly. Can anyone explain this?