differenct performance of one kernel with the same memory

the performance information that print on the console:
**** FreshBuffer[140154480080640] costs : 1330 microseconds
**** reshape_score[140154480080640] costs : 214 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 60 microseconds
**** FreshBuffer[140154480080640] costs : 793 microseconds
**** reshape_score[140154480080640] costs : 21090 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 63 microseconds
**** FreshBuffer[140154480080640] costs : 866 microseconds
**** reshape_score[140154480080640] costs : 20703 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 56 microseconds
**** FreshBuffer[140154480080640] costs : 731 microseconds
**** reshape_score[140154480080640] costs : 20890 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 53 microseconds
**** FreshBuffer[140154480080640] costs : 786 microseconds
**** reshape_score[140154480080640] costs : 20774 microseconds, stream: 0x7f7818001300
**** cudaMemcpyAsync[140154480080640] costs : 54 microseconds

the kernel was called in one host thread:

reshape_score((float*)_src.data, (float*)_src.reshape,
_detector->GetBatch(),
_elSize,
_skipSize,
_unitSize,
_src.stream);
cudaStreamSynchronize(_src.stream)
… othor logic …

the _src.data and _src.reshape are the same size cuda momory pointer and are fixed size memories.

as the console information:
the first call of the kernel use 214 microseconds(after synchronization);
but the later call of the same kernel use more than 20 milliseconds(after synchronization).

i am really confused, and don’t know why and how to fix it.
hope to see your suggestions, thanks.

the reshape_score

#define MAX_NUM_THREAD 1024

inline int GET_THREAD_X(const int N) {
return (N + MAX_NUM_THREAD - 1) / MAX_NUM_THREAD;
}

global void _reshape_score(
float* src,
float* dst,
int batch_el_size,
int batch_el_offset,
int batch_el_el_size)
{
int row = blockDim.xblockIdx.y + threadIdx.x;
if (row < batch_el_el_size)
{
// — src data = [batch el index][batch el size][loop index]
float
src_data = src + blockIdx.x * batch_el_size + batch_el_offset + row;
// dst data
float* dst_data = dst + blockIdx.x * batch_el_size + batch_el_offset + (row << 1);

    dst_data[0] = src_data[0];
    dst_data[1] = src_data[batch_el_el_size];
}

}

void reshape_score(
float* src,
float* dst,
int batch_el_count,
int batch_el_size,
int batch_el_offset,
int batch_el_el_size,
cudaStream_t stream)
{
dim3 thd(MAX_NUM_THREAD);
dim3 bld(batch_el_count, GET_THREAD_X(batch_el_el_size));

#ifdef CONSOLE_DEBUG_DEV_CUDA
printf(“reshape_score<<<(%d,%d,%d),(%d,%d,%d)>>>\n”, bld.x, bld.y, bld.z, thd.x, thd.y, thd.z);
printf(“reshape_score(%p,%p,%d,%d,%d,%d)\n”, src, dst, batch_el_count, batch_el_size, batch_el_offset, batch_el_el_size);
#endif

_reshape_score << <bld, thd, 0, stream >> >
    (src, dst, batch_el_size, batch_el_offset, batch_el_el_size);

}

perhaps you are measuring things incorrectly
perhaps you have a long sequence of asynchronous activity and you have exceeded the launch queue depth

Hi Robert_Crovella,
really appreciate your sugguestion.
but it think it should not exceedded the launch queue depth, because the kernel launched in one stream and only this kernel launched in it.

my thread code logic:

... malloc memories and create stream...
while(_running)
{
    Work()
}
... free memories and destroy stream ...

void Work()
        {
            typedef std::chrono::time_point<std::chrono::system_clock, std::chrono::microseconds> time_point_type;
            time_point_type tb, te;

            tb = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());

            _detector->FreshBuffer(_index, _src.data, _src.size);

            te = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            std::cout << "**** FreshBuffer[" << std::this_thread::get_id() << "] costs : " << (te.time_since_epoch().count() - tb.time_since_epoch().count()) << " microseconds" << std::endl;

            tb = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            reshape_score((float*)_src.data, (float*)_src.reshape,
                _detector->GetBatch(),
                _elSize,
                _skipSize,
                _unitSize,
                _src.stream);
            gpuErrchk(cudaPeekAtLastError());

            gpuErrchk(cudaStreamSynchronize(_src.stream));
            te = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            std::cout << "**** reshape_score[" << std::this_thread::get_id() << "] costs : " << (te.time_since_epoch().count() - tb.time_since_epoch().count()) << " microseconds, stream: " << _src.stream << std::endl;

            tb = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            gpuErrchk(cudaMemcpyAsync(_dst.reshape, _src.reshape, _src.size, cudaMemcpyDeviceToHost, _src.stream));
            gpuErrchk(cudaStreamSynchronize(_src.stream));
            te = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            std::cout << "**** cudaMemcpyAsync[" << std::this_thread::get_id() << "] costs : " << (te.time_since_epoch().count() - tb.time_since_epoch().count()) << " microseconds" << std::endl;

            //gpuErrchk(cudaStreamSynchronize(_src.stream));

            //te = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            //std::cout << "**** reshape_score[" << std::this_thread::get_id() <<"] costs : " << (te.time_since_epoch().count() - tb.time_since_epoch().count()) << " microseconds" << std::endl;
        }

i really can not imagine there were any other kernels being launched or the kernel was launched repeatly.
any advises? thanks again

and i add the launch performance testing code:

tb = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            reshape_score((float*)_src.data, (float*)_src.reshape,
                _detector->GetBatch(),
                _elSize,
                _skipSize,
                _unitSize,
                _src.stream);
            gpuErrchk(cudaPeekAtLastError());
            te = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            std::cout << "**** reshape_score asyn[" << std::this_thread::get_id() << "] costs : " << (te.time_since_epoch().count() - tb.time_since_epoch().count()) << " microseconds, stream: " << _src.stream << std::endl;

            gpuErrchk(cudaStreamSynchronize(_src.stream));
            te = std::chrono::time_point_cast<std::chrono::microseconds>(std::chrono::system_clock::now());
            std::cout << "**** reshape_score syn[" << std::this_thread::get_id() << "] costs : " << (te.time_since_epoch().count() - tb.time_since_epoch().count()) << " microseconds, stream: " << _src.stream << std::endl;

the launch is fast, and the synchronize is slow:

**** FreshBuffer[140636053288704] costs : 612 microseconds
**** reshape_score asyn[140636053288704] costs : 83 microseconds, stream: 0x7fe840001300
**** reshape_score syn[140636053288704] costs : 101 microseconds, stream: 0x7fe840001300
**** cudaMemcpyAsync[140636053288704] costs : 53 microseconds
**** FreshBuffer[140636053288704] costs : 571 microseconds
**** reshape_score asyn[140636053288704] costs : 28 microseconds, stream: 0x7fe840001300
**** reshape_score syn[140636053288704] costs : 20145 microseconds, stream: 0x7fe840001300
**** cudaMemcpyAsync[140636053288704] costs : 51 microseconds
**** FreshBuffer[140636053288704] costs : 722 microseconds
**** reshape_score asyn[140636053288704] costs : 46 microseconds, stream: 0x7fe840001300
**** reshape_score syn[140636053288704] costs : 20069 microseconds, stream: 0x7fe840001300
**** cudaMemcpyAsync[140636053288704] costs : 55 microseconds
**** FreshBuffer[140636053288704] costs : 543 microseconds
**** reshape_score asyn[140636053288704] costs : 34 microseconds, stream: 0x7fe840001300
**** reshape_score syn[140636053288704] costs : 20226 microseconds, stream: 0x7fe840001300
**** cudaMemcpyAsync[140636053288704] costs : 51 microseconds

Its expected that the launch will be fast and the synchronize will be slow. The synchronize absorbs all the execution time of your stream up to that point.

Yes, but i mean, the first running performance should be the same[more or less] as the later ones, is’t it?
and i write another main code to test the same logic and same data size, and you guess what? all is fast, and is 200 microseconds[more or less].

Great Information was very useful for me…

Oh, this kernel is running with the inference framework TVM.
Maybe the inference take most of the compute and memory resouce…
Could it be?