Hi,
I am new here.
I discovered this quite unpleasant nvcc compiler bug not long ago.
A kernel with two “break” operations in different loops works incorrectly.
Sometimes my system hangs (only the reset button can help), sometimes the kernel terminates as expected but the computation results are incorrect.
I am using:
SDK: NVIDIA_CUDA_SDK_1.0
TOOLKIT: NVIDIA_CUDA_Toolkit_1.0_suse_10.2_x86_64
DRIVER: NVIDIA-Linux-x86_64-100.14.11-pkg2
OS: openSUSE 10.2 (X86-64)
Video card: GeForce 8800GTS with 640 Mb RAM
Hardware: 2048 MB RAM, Intel Q6600 2.4 GHz
On the other system the execution of a kernel with two breaks in different loops leads to reboot !
Recompiling the source code under CUDA 1.1 do not solve the problem.
Second system:
SDK: NVIDIA_CUDA_SDK_1.1
TOOLKIT: NVIDIA_CUDA_Toolkit_1.1_Suse10.2_x86_64
DRIVER: NVIDIA Driver for Linux with CUDA Support (169.04)
OS: openSUSE 10.3 (X86-64)
Video card: GeForce 8800GTX with 768 Mb RAM
Hardware: 2048 MB RAM, Athlon 64 X2 4600+
Here is a code sample with the bug:
Code description:
I created a large array of ‘unsigned ints’, initialized them with 0’s and copied the array into the device’s memory.
Each of 48 threads I created gets it’s own part of the array to process.
If any thread finds an array member with value != 0 it writes an ‘1’ to an array in the shared memory and exits the first loop with ‘break’ statement.
Second loop tests if any member of an array in the shared memory is != 0.
If such value is found - the loop will be terminated with the second ‘break’ statement.
If I remove one or two breaks in this code the kernel works perfectly !
I zipped the source code and the makefile into BugReport01.zip
Sorry for my English. I do my best. ;)
#ifndef _TEMPLATE_KERNEL_H_
#define _TEMPLATE_KERNEL_H_
#include <stdio.h>
__global__ void
MyKernel(unsigned int* g_input, unsigned int operand_length)
{
// Shared Memory (quite small - num_threads * sizeof( int) )
extern __shared__ int shared_array[];
// Thread ID
const unsigned int tid = threadIdx.x;
// Number of threads
const unsigned int num_threads = blockDim.x;
// each thread gets its own portion of data
unsigned int pointer_position;
// if all members of array 'g_input' == 0 then is_not_null will be 0
// else is_not_null will be 1
unsigned int is_not_null;
// each thread writes results to dedicated position of shared_array (shared_array[tid])
shared_array[tid] = 0;
for (unsigned int i = 0; i < operand_length; ++i)
{
// each thread gets its own portion of data
pointer_position = tid * operand_length + i;
// if at least one member of array is !=0 then
if (g_input[pointer_position] != 0)
{
// write an 1 into shared_array[tid] ...
shared_array[tid] = 1;
// ... and exit the loop
// This is one of two 'break' in my code that cause malfunctions
// if you remove it the kernel will work perfectly
break;
}
}
__syncthreads();
// The first thread (with tid = 0) collects all results from 'shared_array'
// Other threads have nothing to do
if (tid == 0)
{
is_not_null = 0;
for (unsigned int i = 0; i < num_threads; ++i)
{
// If at least one thread detected an number != 0 in array 'g_input'
// (at least one member of shared_array is 1)
if (shared_array[i] != 0)
{
// at least one member of array is not null !
s_not_null = 1;
// exit the loop
// This is one of two 'break' in my code that cause malfunctions
// if you remove it the kernel will work perfectly
break;
}
}
}
__syncthreads();
}
#endif // #ifndef _TEMPLATE_KERNEL_H_
BugReoprt01.zip (2.79 KB)