Hi,
I am facing a strange issue with the nvcc compiler with one particular program. Nvcc does not complete at all and one core is at 100% use. I checked for 1 full min. It’s a small piece of code that should take 4S. Is it possible that an infinite loop has been triggered ?
The problem is the #pragma unroll within a nested loop.
With the #pragma unroll it takes around 4mins for compilation. Without the #pragma it takes just 4S! There is some exhaustive search going on in the compiler…
rreddy78@jetson-nano:~$ time nvcc --resource-usage --gpu-architecture=sm_53 matrixVectorMultiplication2.cu -o matrixVectorMultiplication2
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘_Z26matrixVectorMultiplicationPKfS0_Pfii’ for ‘sm_53’
ptxas info : Function properties for _Z26matrixVectorMultiplicationPKfS0_Pfii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 10000 bytes smem, 352 bytes cmem[0]
real 3m58.486s
user 3m52.844s
sys 0m1.136s
There might be a performance problem with nvcc after all…
There are times when #pragma unroll can lead to not completion at all ?
Code : NVIDIA, please check if reproducible on a jetson nano…
const int TILE_SIZE = 10000;
__global__ void matrixVectorMultiplication(const float *__restrict__ M, const float *__restrict__ V, float *R, const int M_Size, const int W_Size)
{
__shared__ float VS[TILE_SIZE];
const int COL = blockIdx.x * blockDim.x + threadIdx.x;
/**
* Load the entire vector V into shared memory before processing
*/
for (int j = threadIdx.x; j < TILE_SIZE/4; j = j + blockDim.x)
{
const float4 valueA = reinterpret_cast<const float4 *>(V)[j];
VS[4*j + 0] = valueA.x;
VS[4*j + 1] = valueA.y;
VS[4*j + 2] = valueA.z;
VS[4*j + 3] = valueA.w;
}
__syncthreads();
for(int col = COL;col < M_Size;col = col + blockDim.x)
{
float tmpSum = 0.0f;
#pragma unroll
for (int k = 0; k < TILE_SIZE; ++k)
{
// M is stored column-wise
tmpSum += M[(k*M_Size) + col] * VS[k];
}
R[col] = tmpSum;
}
}