the performance information that print on the console:
**** FreshBuffer[140154480080640] costs : 1330 microseconds
**** reshape_score[140154480080640] costs : 214 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 60 microseconds
**** FreshBuffer[140154480080640] costs : 793 microseconds
**** reshape_score[140154480080640] costs : 21090 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 63 microseconds
**** FreshBuffer[140154480080640] costs : 866 microseconds
**** reshape_score[140154480080640] costs : 20703 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 56 microseconds
**** FreshBuffer[140154480080640] costs : 731 microseconds
**** reshape_score[140154480080640] costs : 20890 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 53 microseconds
**** FreshBuffer[140154480080640] costs : 786 microseconds
**** reshape_score[140154480080640] costs : 20774 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 54 microseconds
the kernel was called in one host thread:
reshape_score((float*)_src.data, (float*)_src.reshape,
_detector->GetBatch(),
_elSize,
_skipSize,
_unitSize,
_src.stream);
cudaStreamSynchronize(_src.stream)
… othor logic …
the _src.data and _src.reshape are the same size cuda momory pointer and are fixed size memories.
as the console information:
the first call of the kernel use 214 microseconds(after synchronization);
but the later call of the same kernel use more than 20 milliseconds(after synchronization).
i am really confused, and don’t know why and how to fix it.
hope to see your suggestions, thanks.