The result of device code is wrong when without a "printf"

I am a new beginner in China,and I am coding a example of stencil_1d in a NVIDIA <<CUDA C/C++ Basics>>PDF,
the input array size is 16 and every item is 1.

//#ifndef __CUDACC__  
//#define __CUDACC__
//#endif

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include<malloc.h>
//以下定义可以使用__syncthreads();
#include <cuda.h>
#include <device_functions.h>
#include <cuda_runtime_api.h>
#include <device_launch_parameters.h>

#define BLOCKSIZE 16
#define DATASIZE BLOCKSIZE * sizeof(int)
#define RADIUS 3

__global__ void stencil_1d(int *in, int *out)
{
	__shared__ int temp[BLOCKSIZE + 2 * RADIUS]; //临时数组-22个元素
	int gindex = threadIdx.x + blockDim.x*blockIdx.x;//全局线程索引
	int lindex = threadIdx.x + RADIUS;//临时数组中的结果索引

	//向共享内存(中的临时数组)读取数据
	temp[lindex] = in[gindex];
	if (gindex < RADIUS)
	{
		temp[lindex - RADIUS] = in[gindex - RADIUS];
		temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
	}

	__syncthreads();
	//cudaThreadSynchronize();//cuda新的同步函数??

	//使用漏字板
	int sum = 0;
	for (int offset = -RADIUS; offset <= RADIUS; offset++)
	{
		sum += temp[lindex + offset];
	}
	
	out[gindex] = sum;
	//printf("Test%6d-%4d\n", gindex, out[gindex]);//----------There is a problem,when I added this sentence the result is right,but if not, the result is wrong,the rest of the code is same as before,----------
}

int main()
{
	int *gpu_data_in, *gpu_data_out;
	int *cpu_data;

	cpu_data = (int *)malloc(DATASIZE);
	for (int i = 0; i < BLOCKSIZE; i++)//初始化cpu数组
	{
		cpu_data[i] = 1;
	}

	cudaMalloc((int **)&gpu_data_in, DATASIZE);
	cudaMalloc((int **)&gpu_data_out, DATASIZE);

	cudaMemcpy(gpu_data_in, cpu_data, DATASIZE, cudaMemcpyHostToDevice);//内存数据复制到设备

	stencil_1d << <1, BLOCKSIZE >> > (gpu_data_in, gpu_data_out);

	cudaMemcpy(cpu_data, gpu_data_out, DATASIZE, cudaMemcpyDeviceToHost);//设备数据复制到内存

	for (int i = 0; i < BLOCKSIZE; i++)
	{
		printf("%d\t", cpu_data[i]);
		if ((i + 1) % 8 == 0)
			printf("\n");
	}

	cudaFree(gpu_data_in);
	cudaFree(gpu_data_out);

	return 0;
}

This line has a problem.
printf(“Test%6d-%4d\n”, gindex, out[gindex]);//when I added this sentence the result is right,but if not, the result is wrong,the rest of the code is same as before,

for example,when the BLOCKSIZE is 16,the result is

1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1

and this is wrong
(without a printf code)
and

Test  0-4
Test  1-5
Test  2-6
Test  3-7
Test  4-7
Test  5-7
Test  6-7
Test  7-7
Test  8-7
Test  9-7
Test  10-7
Test  11-7
Test  12-7
Test  13-6
Test  14-5
Test  15-4
4 5 6 7 7 7 7 7
7 7 7 7 7 6 5 4

(with a printf code)the result is right

but I couldn’t understand why? How can I fix it and without a printf code?

  1. Cache data in shared memory Read (blockDim.x + 2 * radius) input elements from global memory to shared memory Compute blockDim.x output elements Write blockDim.x output elements to global memory Each block needs a halo of radius elements at each boundary

Cache data in shared memory
Read (blockDim.x + 2 * radius) input elements from global memory to shared memory
Compute blockDim.x output elements
Write blockDim.x output elements to global memory
Each block needs a halo of radius elements at each boundary

If the problem goes away when you add a call to printf(), it is likely a Heisenbug (https://en.wikipedia.org/wiki/Heisenbug). Possibly a race condition, or data corruption. What happens when you run the code under control of cuda-memcheck? Does it report any issues?

Sorry,I don’t what is “cuda-memcheck”
maybe I think is there any error during the running of code?
There isn’t, just this one.

but there is a red wave line under “__syncthreads();”

and I’ll have a look at your link now,thank you~
^。^

And I used Visual Studio 2015 in Windows 7, and my Gpu is GT630M,

I would recommend to make use of the copious amount of documentation that ships with CUDA. For example:

http://docs.nvidia.com/cuda/cuda-memcheck

Thanks a lot~~~
^。^

Your code has errors.

Before asking for help on any public forum, if you are having trouble with a CUDA code, I always recommend proper CUDA error checking, and running your code with cuda-memcheck.

If you don’t know what proper CUDA error checking is, google “proper CUDA error checking” and take the first hit. Study it, and add it to your code.

Also run your code with cuda-memcheck.

When I do that, it gives me error output like this:

$ cuda-memcheck ./t17
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x1050d5ffffc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x1050d5ffff8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x1050d5ffff4 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x3885f]
=========     Host Frame:./t17 [0x2bba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x40a36]
=========     Host Frame:./t17 [0x2c17]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x40a36]
=========     Host Frame:./t17 [0x2c23]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= ERROR SUMMARY: 6 errors
$

There is no point in wondering why your code doesn’t behave correctly, until you’ve at least removed all coding errors, and cuda-memcheck reports “no errors”.

Its fairly evident to me that you have a coding error here, for example:

if (gindex < RADIUS)
        {
                temp[lindex - RADIUS] = in[gindex - RADIUS];
                                           ^^^^^^^^^^^^^^^
                temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
        }

As a simple example, one possible value for gindex is zero. If gindex is zero, then gindex will be less than RADIUS, and the body of the if-statement will be executed for that thread.

The first line of code in the body of the if-statement will generate a global memory index of gindex - RADIUS. Since gindex is zero, gindex - RADIUS will be a negative number. This will be an invalid (out-of-bounds) index, when used on the array in.

I note that your RADIUS is 3, therefore 3 threads will generate an invalid negative index (those threads whose gindex values are 0, 1, or 2), and cuda-memcheck happens to be reporting 3 instances of out-of-bounds indexing.

Thanks a lot for your help,
and in

if (gindex < RADIUS)
        {
                temp[lindex - RADIUS] = in[gindex - RADIUS];
                                           ^^^^^^^^^^^^^^^
                temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
        }

I replace

temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];

with

temp[lindex - RADIUS] = 0;
temp[lindex + BLOCKSIZE] = 0;

and now it run correctly, I have thought the out-of-bounds in array in,but I find that with “printf()” statement, it works well, so a supposed it is right wrongly, now I know where the problem is, and I will learn your advise after lunch,

Thanks a lot again~