Issue about execution time of kernel functions becomes longer sometimes on TX1

In my current project on TX1, I implemented some kernel functions and also called some lib function like cublasHgemm/cublasSgemm. They would be called repeatedly for many times when the program is running.

I found the execution time of kernels is sometimes longer than expected, and they are out of proper range periodically. I take two examples of kernel functions for showing this issue. The first one is simply copy function. The second one is cublasSgemm. The partly code is as following:

void copy_ongpu(float * data, int n, float a, int index)
{
    copy_kernel<<<cuda_gridsize(n), BLOCK>>>(data, n, a, index);
}

__global__ void copy_kernel(float *data, int n, float a, int index)
{
    int i = (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x + threadIdx.x;
    if(i < n) data[i*index] = a;
}
void gemm(int TA, int TB, int M, int N, int K, float ALPHA, 
        float *A, int lda, 
        float *B, int ldb,
        float BETA,
        float *C, int ldc)
{
    cublasHandle_t handle = blas_handle();
    cudaError_t status = cublasSgemm(handle, (TB ? CUBLAS_OP_T : CUBLAS_OP_N), 
            (TA ? CUBLAS_OP_T : CUBLAS_OP_N), N, M, K, &ALPHA, B, ldb, A, lda, &BETA, C, ldc);
}

The execution time of the two kernel function is shown in the following list:
1_0.000245, 2_0.060202
1_0.012942, 2_0.075983
1_0.000337, 2_0.096875
1_0.000401, 2_0.089339
1_0.000090, 2_0.001132
1_0.005623, 2_0.086324
1_0.000158, 2_0.001125
1_0.000117, 2_0.001058
1_0.000161, 2_0.001382
1_0.000399, 2_0.072303
1_0.000259, 2_0.077122
1_0.000253, 2_0.022333
1_0.000409, 2_0.078325
1_0.000275, 2_0.040005
1_0.017100, 2_0.056319
1_0.000598, 2_0.077978
1_0.012182, 2_0.001941
1_0.005460, 2_0.001489
1_0.000273, 2_0.001107
1_0.000311, 2_0.073162
1_0.008051, 2_0.001236
1_0.000314, 2_0.100421
1_0.000292, 2_0.086024
1_0.005424, 2_0.080342
1_0.000250, 2_0.001184
1_0.000307, 2_0.001103
1_0.000280, 2_0.083244
1_0.021767, 2_0.057224
1_0.000283, 2_0.085701
1_0.000078, 2_0.001483
1_0.012397, 2_0.068785
1_0.012381, 2_0.066995
1_0.000170, 2_0.000868
1_0.085404, 2_0.001049
1_0.000061, 2_0.000422
1_0.005006, 2_0.080800

As the bolder font line indicated, the first kernel is sometimes 0.0X or 0.00X, instead of normal value 0.000X. Such longer time has been decrease the performance of program critically.

I just didn’t know what else kind of operation could impact the execution time of kernel function. Does anyone meet such issue?

Really appreciate for any help on this topic!

(1) It’s not clear what the timing methodology is; you may be measuring something other than what you think you are measuring

(2) The TX1 uses the same physical memory for the CPU and the GPU. I don’t know how the chip prioritizes memory access by the two processors, but conceivably CPU work can reduce bandwidth available to the GPU, negatively impacting GPU performance.

(3) You are likely to receive better answers in the sub-forum dedicated to TX1: https://devtalk.nvidia.com/default/board/164/jetson-tx1/

Thanks for your information. njuffa. I’ll forward this post on TX1 sub-forum.