Possible NVCC compiler bug Two 'breaks' in different loops :(

Hi,

I am new here.

I discovered this quite unpleasant nvcc compiler bug not long ago.

A kernel with two “break” operations in different loops works incorrectly.

Sometimes my system hangs (only the reset button can help), sometimes the kernel terminates as expected but the computation results are incorrect.

I am using:

SDK: NVIDIA_CUDA_SDK_1.0

TOOLKIT: NVIDIA_CUDA_Toolkit_1.0_suse_10.2_x86_64

DRIVER: NVIDIA-Linux-x86_64-100.14.11-pkg2

OS: openSUSE 10.2 (X86-64)

Video card: GeForce 8800GTS with 640 Mb RAM

Hardware: 2048 MB RAM, Intel Q6600 2.4 GHz

On the other system the execution of a kernel with two breaks in different loops leads to reboot !

Recompiling the source code under CUDA 1.1 do not solve the problem.

Second system:

SDK: NVIDIA_CUDA_SDK_1.1

TOOLKIT: NVIDIA_CUDA_Toolkit_1.1_Suse10.2_x86_64

DRIVER: NVIDIA Driver for Linux with CUDA Support (169.04)

OS: openSUSE 10.3 (X86-64)

Video card: GeForce 8800GTX with 768 Mb RAM

Hardware: 2048 MB RAM, Athlon 64 X2 4600+

Here is a code sample with the bug:

Code description:

I created a large array of ‘unsigned ints’, initialized them with 0’s and copied the array into the device’s memory.

Each of 48 threads I created gets it’s own part of the array to process.

If any thread finds an array member with value != 0 it writes an ‘1’ to an array in the shared memory and exits the first loop with ‘break’ statement.

Second loop tests if any member of an array in the shared memory is != 0.

If such value is found - the loop will be terminated with the second ‘break’ statement.

If I remove one or two breaks in this code the kernel works perfectly !

I zipped the source code and the makefile into BugReport01.zip

Sorry for my English. I do my best. ;)

#ifndef _TEMPLATE_KERNEL_H_

#define _TEMPLATE_KERNEL_H_

#include <stdio.h>

__global__ void

MyKernel(unsigned int* g_input, unsigned int operand_length)

{

   // Shared Memory (quite small -  num_threads * sizeof( int) )

    extern  __shared__  int shared_array[];

   // Thread ID

    const unsigned int tid = threadIdx.x;

    

    // Number of threads

    const unsigned int num_threads = blockDim.x;

   // each thread gets its own portion of data

    unsigned int pointer_position;

   // if all members of array 'g_input' == 0 then is_not_null will be 0

    // else is_not_null will be 1

    unsigned int is_not_null;

    

    // each thread writes results to dedicated position of shared_array (shared_array[tid])

    shared_array[tid] = 0;

   for (unsigned int i = 0; i < operand_length; ++i)

    {

        // each thread gets its own portion of data

        pointer_position = tid * operand_length + i;

       // if at least one  member of array is !=0 then

        if (g_input[pointer_position] != 0)

        {

            // write an 1 into shared_array[tid] ...

            shared_array[tid] = 1;

            // ... and exit the loop

           // This is one of two 'break' in my code that cause malfunctions

            // if you remove it the kernel will work perfectly

            break;

        }

    }

    __syncthreads();

   // The first thread (with tid = 0) collects all results from 'shared_array'

    // Other threads have nothing to do

    if (tid == 0)

    {

        is_not_null = 0;

        for (unsigned int i = 0; i < num_threads; ++i)

        {

            // If at least one thread detected an number != 0 in array 'g_input'

            // (at least one member of shared_array is 1)

            if (shared_array[i] != 0)

            {

                // at least one member of array is not null !

                s_not_null = 1;

               // exit the loop

               // This is one of two 'break' in my code that cause malfunctions

                // if you remove it the kernel will work perfectly

                break;

            }

        }

    }

    __syncthreads();

}

#endif // #ifndef _TEMPLATE_KERNEL_H_

BugReoprt01.zip (2.79 KB)

I’m attempting to replicate this problem, however its not clear what the desired behavior of your test app should be?

When I build & attempt to run your test app on Linux-x86_64 against CUDA_1.1, it just runs indefinitely, however there is never any instability.

Also, please generate and attach an nvidia-bug-report.log.

thanks,
Lonni

Hi,

here I am at last.

This code example does nothing useful. It’s a part of a larger project, I simplified the code as much as possible to present the bug.

When executed the code causes one system to reboot, and the other system stops responding. It’s a single purpose of the code :)

Sometimes (1 out 10 tries) the application runs without any instabilities.

If you try to execute the app 4 or 5 times in a row it will crash for sure (at least on my both computers).

If you won’t be able to replicate my problem, please try to remove both “break” statements from the code (in BugReport01_kernel.cu), and compare application’s execution time.

Without "break"s the application runs much faster (in that rare case my PC survived the execution of this code).

Here is it. I compressed the file because the forum’s software doesn’t allow me to attach a .log file.

Thanks !
bug_report_logfile.zip (33.3 KB)

I’ve run your test app 10 times in a row, and there was no instability at any time.

I’m not sure that I understand how removing the break statements from BugReport01_kernel.cu to make the application run faster relates to the instability you’ve reported here. Can you clarify?

Does this problem persist if X is NOT running while your app is running?
Have you verified that you’re using the latest motherboard BIOS?
Does this problem persist with a more recent kernel (2.6.18 is rather old)?
Does this problem persist if your kernrel is not using vesafb?