CUDA double for loop inside kernel CUDA double for loop limit

Hi,

I am new to CUDA. I was just doing some experiments in CUDA kernel and noticed that if i give two for loops inside a CUDA kernel with 907 * 32 threads my system hangs.
I am using Tesla C1060 with 4 GB of system memory and 2.6 Ghz Intel core 2. OS: windows 7 64 bit.

Following are the kernel details.

Dimension:
Block Dim(512,1,1)
Grid Dim(57,1)
Kernel:
global void ApplyConvolution(
float* pfInputProjections, // 907 Input images with 912 * 32 * sizeof(float)
float* pfConvolutionKernel, // a 1D array of size 1024 * sizeof(float)
int nProjectionWidth, // 912
float* pfOutputProjections ) // 907 Output image with 912 * 32 * sizeof(float)
{
unsigned nThreadIndex = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x;

if(nThreadIndex > (32*907))
    return;

for(int i = 0; i < 912; i++ ) // first loop
{
    for(int j = 0;j < 50; j++) // second loop
    {
        pfOutputProjections[i] = pfInputProjections[i] * pfConvolutionKernel[i];
    }
}

}

Note: if i give second for loop max value as 5, kernel executes properly. if i give the value to 50 as shown above, system hangs and resets, similar of watchdog timer reset.
why this is happening. Is there any limit on for loops inside kernel? Any help would be appreciated.

Hi,

I am new to CUDA. I was just doing some experiments in CUDA kernel and noticed that if i give two for loops inside a CUDA kernel with 907 * 32 threads my system hangs.
I am using Tesla C1060 with 4 GB of system memory and 2.6 Ghz Intel core 2. OS: windows 7 64 bit.

Following are the kernel details.

Dimension:
Block Dim(512,1,1)
Grid Dim(57,1)
Kernel:
global void ApplyConvolution(
float* pfInputProjections, // 907 Input images with 912 * 32 * sizeof(float)
float* pfConvolutionKernel, // a 1D array of size 1024 * sizeof(float)
int nProjectionWidth, // 912
float* pfOutputProjections ) // 907 Output image with 912 * 32 * sizeof(float)
{
unsigned nThreadIndex = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x;

if(nThreadIndex > (32*907))
    return;

for(int i = 0; i < 912; i++ ) // first loop
{
    for(int j = 0;j < 50; j++) // second loop
    {
        pfOutputProjections[i] = pfInputProjections[i] * pfConvolutionKernel[i];
    }
}

}

Note: if i give second for loop max value as 5, kernel executes properly. if i give the value to 50 as shown above, system hangs and resets, similar of watchdog timer reset.
why this is happening. Is there any limit on for loops inside kernel? Any help would be appreciated.

It is possible that the compiler optimized the J loop out. “Loop invariant detection” and “Dead code elimination” phases of compiler will completely remove the second FOR loop. I wonder what is the necessity of a loop there?

It is possible that the compiler optimized the J loop out. “Loop invariant detection” and “Dead code elimination” phases of compiler will completely remove the second FOR loop. I wonder what is the necessity of a loop there?

Hi,

Thanks for the reply. It seems compiler is not eliminating j loop. if i give a higher j loop value, kernel execution time is increasing.

My program crashed because of watch dog timer reset as shown in post: http://forums.nvidia.com/lofiversion/index.php?t71855.html

I have figured out the issue, time consumption is because of high global memory access and leads to watch dog timer reset.

Thank you.

Hi,

Thanks for the reply. It seems compiler is not eliminating j loop. if i give a higher j loop value, kernel execution time is increasing.

My program crashed because of watch dog timer reset as shown in post: http://forums.nvidia.com/lofiversion/index.php?t71855.html

I have figured out the issue, time consumption is because of high global memory access and leads to watch dog timer reset.

Thank you.

Good that you found this out… You should always check errors using “err=cudaThreadSynchronize()” – It can save lot of time…

Good that you found this out… You should always check errors using “err=cudaThreadSynchronize()” – It can save lot of time…