Issue about execution time of kernel functions becomes longer sometimes on TX1

In my current project on TX1, I implemented some kernel functions and also called some lib function like cublasHgemm/cublasSgemm. They would be called repeatedly for many times when the program is running.

I found the execution time of kernels is sometimes longer than expected, and they are out of proper range periodically. I take two examples of kernel functions for showing this issue. The first one is simply copy function. The second one is cublasSgemm. The partly code is as following:

void copy_ongpu(float * data, int n, float a, int index)
    {
        copy_kernel<<<cuda_gridsize(n), BLOCK>>>(data, n, a, index);
    }

    __global__ void copy_kernel(float *data, int n, float a, int index)
    {
        int i = (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x + threadIdx.x;
        if(i < n) data[i*index] = a;
    }
void gemm(int TA, int TB, int M, int N, int K, float ALPHA, 
            float *A, int lda, 
            float *B, int ldb,
            float BETA,
            float *C, int ldc)
    {
        cublasHandle_t handle = blas_handle();
        cudaError_t status = cublasSgemm(handle, (TB ? CUBLAS_OP_T : CUBLAS_OP_N), 
                (TA ? CUBLAS_OP_T : CUBLAS_OP_N), N, M, K, &ALPHA, B, ldb, A, lda, &BETA, C, ldc);
    }

The execution time of the two kernel function is shown in the following list:
1_0.000245, 2_0.060202
1_0.012942, 2_0.075983
1_0.000337, 2_0.096875
1_0.000401, 2_0.089339
1_0.000090, 2_0.001132
1_0.005623, 2_0.086324
1_0.000158, 2_0.001125
1_0.000117, 2_0.001058
1_0.000161, 2_0.001382
1_0.000399, 2_0.072303
1_0.000259, 2_0.077122
1_0.000253, 2_0.022333
1_0.000409, 2_0.078325
1_0.000275, 2_0.040005
1_0.017100, 2_0.056319
1_0.000598, 2_0.077978
1_0.012182, 2_0.001941
1_0.005460, 2_0.001489
1_0.000273, 2_0.001107
1_0.000311, 2_0.073162
1_0.008051, 2_0.001236
1_0.000314, 2_0.100421
1_0.000292, 2_0.086024
1_0.005424, 2_0.080342
1_0.000250, 2_0.001184
1_0.000307, 2_0.001103
1_0.000280, 2_0.083244
1_0.021767, 2_0.057224
1_0.000283, 2_0.085701
1_0.000078, 2_0.001483
1_0.012397, 2_0.068785
1_0.012381, 2_0.066995
1_0.000170, 2_0.000868
1_0.085404, 2_0.001049
1_0.000061, 2_0.000422
1_0.005006, 2_0.080800

As the bolder font line indicated, the first kernel is sometimes 0.0X or 0.00X, instead of normal value 0.000X. Such longer time has been decrease the performance of program critically. BTW, the time function I used for test is clock().

I just didn’t know what else kind of operation could impact the execution time of kernel function. Does anyone meet such issue?

Really appreciate for any help on this topic!

YangHao,
If you could attach binary, I can run it in my Jetson TX1 board.

Hi, chijen
Sorry for feedback late.
I’ve found the reason of this issue. The kernel function (A) mentioned in my post is executed in a sub-thread. However there is another kernel function (B) executed in the main thread. After I removed B, there’s no unexpected long the execution time for A. So considering the priority of thread schedule, I think all of kernel functons shall be implemented in sub-threads.
BTW, if there’s any other factor that may affect the kernel function execution, please let me know.