the performance information that print on the console:

**** FreshBuffer[140154480080640] costs : 1330 microseconds

**** reshape_score[140154480080640] costs : 214 microseconds, stream: 0x7f7818001300

**** cudaMemcpyAsync[140154480080640] costs : 60 microseconds

**** FreshBuffer[140154480080640] costs : 793 microseconds

**** reshape_score[140154480080640] costs : 21090 microseconds, stream: 0x7f7818001300

**** cudaMemcpyAsync[140154480080640] costs : 63 microseconds

**** FreshBuffer[140154480080640] costs : 866 microseconds

**** reshape_score[140154480080640] costs : 20703 microseconds, stream: 0x7f7818001300

**** cudaMemcpyAsync[140154480080640] costs : 56 microseconds

**** FreshBuffer[140154480080640] costs : 731 microseconds

**** reshape_score[140154480080640] costs : 20890 microseconds, stream: 0x7f7818001300

**** cudaMemcpyAsync[140154480080640] costs : 53 microseconds

**** FreshBuffer[140154480080640] costs : 786 microseconds

**** reshape_score[140154480080640] costs : 20774 microseconds, stream: 0x7f7818001300

**** cudaMemcpyAsync[140154480080640] costs : 54 microseconds

## the kernel was called in one host thread:

## reshape_score((float*)_src.data, (float*)_src.reshape,

_detector->GetBatch(),

_elSize,

_skipSize,

_unitSize,

_src.stream);

cudaStreamSynchronize(_src.stream)

… othor logic …

the _src.data and _src.reshape are the same size cuda momory pointer and are fixed size memories.

as the console information:

the first call of the kernel use 214 microseconds(after synchronization);

but the later call of the same kernel use more than 20 milliseconds(after synchronization).

i am really confused, and don’t know why and how to fix it.

hope to see your suggestions, thanks.