The result of device code is wrong when without a "printf"

HiJacker · October 22, 2016, 1:32am

I am a new beginner in China,and I am coding a example of stencil_1d in a NVIDIA <<CUDA C/C++ Basics>>PDF,
the input array size is 16 and every item is 1.

//#ifndef __CUDACC__  
//#define __CUDACC__
//#endif

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include<malloc.h>
//以下定义可以使用__syncthreads();
#include <cuda.h>
#include <device_functions.h>
#include <cuda_runtime_api.h>
#include <device_launch_parameters.h>

#define BLOCKSIZE 16
#define DATASIZE BLOCKSIZE * sizeof(int)
#define RADIUS 3

__global__ void stencil_1d(int *in, int *out)
{
	__shared__ int temp[BLOCKSIZE + 2 * RADIUS]; //临时数组-22个元素
	int gindex = threadIdx.x + blockDim.x*blockIdx.x;//全局线程索引
	int lindex = threadIdx.x + RADIUS;//临时数组中的结果索引

	//向共享内存(中的临时数组)读取数据
	temp[lindex] = in[gindex];
	if (gindex < RADIUS)
	{
		temp[lindex - RADIUS] = in[gindex - RADIUS];
		temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
	}

	__syncthreads();
	//cudaThreadSynchronize();//cuda新的同步函数??

	//使用漏字板
	int sum = 0;
	for (int offset = -RADIUS; offset <= RADIUS; offset++)
	{
		sum += temp[lindex + offset];
	}
	
	out[gindex] = sum;
	//printf("Test%6d-%4d\n", gindex, out[gindex]);//----------There is a problem,when I added this sentence the result is right,but if not, the result is wrong,the rest of the code is same as before,----------
}

int main()
{
	int *gpu_data_in, *gpu_data_out;
	int *cpu_data;

	cpu_data = (int *)malloc(DATASIZE);
	for (int i = 0; i < BLOCKSIZE; i++)//初始化cpu数组
	{
		cpu_data[i] = 1;
	}

	cudaMalloc((int **)&gpu_data_in, DATASIZE);
	cudaMalloc((int **)&gpu_data_out, DATASIZE);

	cudaMemcpy(gpu_data_in, cpu_data, DATASIZE, cudaMemcpyHostToDevice);//内存数据复制到设备

	stencil_1d << <1, BLOCKSIZE >> > (gpu_data_in, gpu_data_out);

	cudaMemcpy(cpu_data, gpu_data_out, DATASIZE, cudaMemcpyDeviceToHost);//设备数据复制到内存

	for (int i = 0; i < BLOCKSIZE; i++)
	{
		printf("%d\t", cpu_data[i]);
		if ((i + 1) % 8 == 0)
			printf("\n");
	}

	cudaFree(gpu_data_in);
	cudaFree(gpu_data_out);

	return 0;
}

This line has a problem.
printf(“Test%6d-%4d\n”, gindex, out[gindex]);//when I added this sentence the result is right,but if not, the result is wrong,the rest of the code is same as before,

for example,when the BLOCKSIZE is 16,the result is

1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1

and this is wrong
(without a printf code)
and

Test  0-4
Test  1-5
Test  2-6
Test  3-7
Test  4-7
Test  5-7
Test  6-7
Test  7-7
Test  8-7
Test  9-7
Test  10-7
Test  11-7
Test  12-7
Test  13-6
Test  14-5
Test  15-4
4 5 6 7 7 7 7 7
7 7 7 7 7 6 5 4

(with a printf code)the result is right

but I couldn’t understand why? How can I fix it and without a printf code?

HiJacker · October 22, 2016, 1:34am

Cache data in shared memory Read (blockDim.x + 2 * radius) input elements from global memory to shared memory Compute blockDim.x output elements Write blockDim.x output elements to global memory Each block needs a halo of radius elements at each boundary

Cache data in shared memory
Read (blockDim.x + 2 * radius) input elements from global memory to shared memory
Compute blockDim.x output elements
Write blockDim.x output elements to global memory
Each block needs a halo of radius elements at each boundary

njuffa · October 22, 2016, 2:30am

If the problem goes away when you add a call to printf(), it is likely a Heisenbug (Heisenbug - Wikipedia). Possibly a race condition, or data corruption. What happens when you run the code under control of cuda-memcheck? Does it report any issues?

HiJacker · October 22, 2016, 2:47am

Sorry,I don’t what is “cuda-memcheck”
maybe I think is there any error during the running of code?
There isn’t, just this one.

but there is a red wave line under “__syncthreads();”

and I’ll have a look at your link now,thank you~
^。^

HiJacker · October 22, 2016, 2:48am

And I used Visual Studio 2015 in Windows 7, and my Gpu is GT630M,

njuffa · October 22, 2016, 3:03am

I would recommend to make use of the copious amount of documentation that ships with CUDA. For example:

[url]http://docs.nvidia.com/cuda/cuda-memcheck[/url]

HiJacker · October 22, 2016, 3:09am

Thanks a lot~~~
^。^

Robert_Crovella · October 22, 2016, 3:21am

Your code has errors.

Before asking for help on any public forum, if you are having trouble with a CUDA code, I always recommend proper CUDA error checking, and running your code with cuda-memcheck.

If you don’t know what proper CUDA error checking is, google “proper CUDA error checking” and take the first hit. Study it, and add it to your code.

Also run your code with cuda-memcheck.

When I do that, it gives me error output like this:

$ cuda-memcheck ./t17
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x1050d5ffffc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x1050d5ffff8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x1050d5ffff4 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x3885f]
=========     Host Frame:./t17 [0x2bba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x40a36]
=========     Host Frame:./t17 [0x2c17]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x40a36]
=========     Host Frame:./t17 [0x2c23]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= ERROR SUMMARY: 6 errors
$

There is no point in wondering why your code doesn’t behave correctly, until you’ve at least removed all coding errors, and cuda-memcheck reports “no errors”.

Its fairly evident to me that you have a coding error here, for example:

if (gindex < RADIUS)
        {
                temp[lindex - RADIUS] = in[gindex - RADIUS];
                                           ^^^^^^^^^^^^^^^
                temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
        }

As a simple example, one possible value for gindex is zero. If gindex is zero, then gindex will be less than RADIUS, and the body of the if-statement will be executed for that thread.

The first line of code in the body of the if-statement will generate a global memory index of gindex - RADIUS. Since gindex is zero, gindex - RADIUS will be a negative number. This will be an invalid (out-of-bounds) index, when used on the array in.

I note that your RADIUS is 3, therefore 3 threads will generate an invalid negative index (those threads whose gindex values are 0, 1, or 2), and cuda-memcheck happens to be reporting 3 instances of out-of-bounds indexing.

HiJacker · October 22, 2016, 3:41am

Your code has errors.

Before asking for help on any public forum, if you are having trouble with a CUDA code, I always recommend proper CUDA error checking, and running your code with cuda-memcheck.

If you don’t know what proper CUDA error checking is, google “proper CUDA error checking” and take the first hit. Study it, and add it to your code.

Also run your code with cuda-memcheck.

When I do that, it gives me error output like this:

$ cuda-memcheck ./t17
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x1050d5ffffc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x1050d5ffff8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Invalid __global__ read of size 4
=========     at 0x00000090 in stencil_1d(int*, int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x1050d5ffff4 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x203185]
=========     Host Frame:./t17 [0x187b1]
=========     Host Frame:./t17 [0x36353]
=========     Host Frame:./t17 [0x2def]
=========     Host Frame:./t17 [0x2cdf]
=========     Host Frame:./t17 [0x2d04]
=========     Host Frame:./t17 [0x2ba0]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x3885f]
=========     Host Frame:./t17 [0x2bba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x40a36]
=========     Host Frame:./t17 [0x2c17]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2ed9a3]
=========     Host Frame:./t17 [0x40a36]
=========     Host Frame:./t17 [0x2c23]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
=========     Host Frame:./t17 [0x29b9]
=========
========= ERROR SUMMARY: 6 errors
$

There is no point in wondering why your code doesn’t behave correctly, until you’ve at least removed all coding errors, and cuda-memcheck reports “no errors”.

Its fairly evident to me that you have a coding error here, for example:

if (gindex < RADIUS)
        {
                temp[lindex - RADIUS] = in[gindex - RADIUS];
                                           ^^^^^^^^^^^^^^^
                temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
        }

As a simple example, one possible value for gindex is zero. If gindex is zero, then gindex will be less than RADIUS, and the body of the if-statement will be executed for that thread.

The first line of code in the body of the if-statement will generate a global memory index of gindex - RADIUS. Since gindex is zero, gindex - RADIUS will be a negative number. This will be an invalid (out-of-bounds) index, when used on the array in.

I note that your RADIUS is 3, therefore 3 threads will generate an invalid negative index (those threads whose gindex values are 0, 1, or 2), and cuda-memcheck happens to be reporting 3 instances of out-of-bounds indexing.

Thanks a lot for your help,
and in

if (gindex < RADIUS)
        {
                temp[lindex - RADIUS] = in[gindex - RADIUS];
                                           ^^^^^^^^^^^^^^^
                temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];
        }

I replace

temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCKSIZE] = in[gindex + BLOCKSIZE];

with

temp[lindex - RADIUS] = 0;
temp[lindex + BLOCKSIZE] = 0;

and now it run correctly, I have thought the out-of-bounds in array in,but I find that with “printf()” statement, it works well, so a supposed it is right wrongly, now I know where the problem is, and I will learn your advise after lunch,

Thanks a lot again~

Topic		Replies	Views
Memcheck CUDA Programming and Performance	2	586	July 20, 2017
Possible nvcc bug? CUDA Programming and Performance	13	8907	January 9, 2011
different results for same CUDA code CUDA Programming and Performance	13	10935	August 5, 2011
Unknow error when calling device subroutine Legacy PGI Compilers	18	10597	September 1, 2015
Mysteriously changing shared memory variable Race condition? Overwriting an array? CUDA Programming and Performance	12	6153	November 25, 2011
Results differ when compiled with sm_10 and sm_20 CUDA Programming and Performance	19	2607	June 17, 2011
Problem with two seemingly unrelated sections in my code CUDA Programming and Performance	3	564	November 10, 2017
Kernel within a kernel: out of bounds CUDA Programming and Performance	2	82	August 1, 2024
Printf and thread idx problems CUDA Programming and Performance	8	8665	March 30, 2012
Help with strange error CUDA Programming and Performance	8	2198	February 25, 2010

The result of device code is wrong when without a "printf"

Related topics