Nvcc on jetson nano


I am facing a strange issue with the nvcc compiler with one particular program. Nvcc does not complete at all and one core is at 100% use. I checked for 1 full min. It’s a small piece of code that should take 4S. Is it possible that an infinite loop has been triggered ?

The problem is the #pragma unroll within a nested loop.

With the #pragma unroll it takes around 4mins for compilation. Without the #pragma it takes just 4S! There is some exhaustive search going on in the compiler…

rreddy78@jetson-nano:~$ time nvcc --resource-usage --gpu-architecture=sm_53 matrixVectorMultiplication2.cu -o matrixVectorMultiplication2
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z26matrixVectorMultiplicationPKfS0_Pfii’ for ‘sm_53’
ptxas info : Function properties for _Z26matrixVectorMultiplicationPKfS0_Pfii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 10000 bytes smem, 352 bytes cmem[0]

real 3m58.486s
user 3m52.844s
sys 0m1.136s

There might be a performance problem with nvcc after all…

There are times when #pragma unroll can lead to not completion at all ?

Code : NVIDIA, please check if reproducible on a jetson nano…

    const int TILE_SIZE = 10000;

__global__ void matrixVectorMultiplication(const float *__restrict__ M, const float *__restrict__ V, float *R, const int M_Size, const int W_Size)
__shared__ float VS[TILE_SIZE];

const int COL = blockIdx.x * blockDim.x + threadIdx.x;
* Load the entire vector V into shared memory before processing
for (int j = threadIdx.x; j < TILE_SIZE/4; j = j + blockDim.x) 
    const float4 valueA = reinterpret_cast<const float4 *>(V)[j];

    VS[4*j + 0] = valueA.x;
    VS[4*j + 1] = valueA.y;
    VS[4*j + 2] = valueA.z;
    VS[4*j + 3] = valueA.w;


for(int col = COL;col < M_Size;col = col + blockDim.x)
    float tmpSum = 0.0f;

    #pragma unroll
    for (int k = 0; k < TILE_SIZE; ++k)
        // M is stored column-wise
        tmpSum += M[(k*M_Size) + col] * VS[k];
    R[col] = tmpSum;