Very simple CUDA program bad output

AlexMagsam · July 3, 2017, 4:39pm

Hello,
I have been working on a CUDA project although I’ve been experiencing errors along the way so I decided to reduce the program down to its simplest form to test the functionality. There is still an incorrect output for this very simple program, which has created the idea that maybe the issue lies in the TX2 board itself or problems with CUDA running on the board.

Sample programs do work on the board such as deviceQuery,vectorAdd,bandwidthTest and so on. Maybe there is actually an error within this program that I am not seeing, but there are no problems compiling and no issues with cuda-memcheck. Below is a sample code ready to be compiled. Let me know if you receive similar or different output.

//#include <cufft.h>
//#include <cuda_runtime.h>
//#include <cuda.h>
//#include <cuda_device_runtime_api.h>

#include <iostream>
#include <stdlib.h>
#include <stdio.h>

#define height (2048ULL)
#define width (2448ULL)
#define size (height*width)

__global__ void datatransfer(float *f)
{

int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;

if (x>0 && x<width && y>0 && y< height) {
f[width*y+x] =  1.0f;

}

}

int main() 
{

float *array = new float;
float *dev_array;

for (int i=0;i<size;i++)
	{
	array[i] = i;
	}

std::cout<<" "<<std::endl;
std::cout<<"Cuda Array test: " <<std::endl;
for (int i=0;i<5;i++) {
	std::cout<<array[10*i] <<"  "<< array[20*i] <<"  "<< array[30*i] <<"  "<< array[40*i] <<"  "<< array[50*i] <<std::endl;
	} 

if (cudaMalloc((void **)&dev_array,size* sizeof(float)) != cudaSuccess)
	{
	fprintf(stderr,"Cuda Error: failed to allocate dev_array");
	}

cudaMemcpy(dev_array,array,size* sizeof(float),cudaMemcpyHostToDevice);
if (cudaGetLastError() != cudaSuccess) 
	{
	fprintf(stderr, "Cuda Error: Failed to copy to dev_array\n");	
	}

dim3 threadsPerBlock(32,32);
dim3 numBlocks((width+threadsPerBlock.x-1)/threadsPerBlock.x,(height+threadsPerBlock.y-1)/threadsPerBlock.y);

datatransfer<<<numBlocks,threadsPerBlock>>>(dev_array);

if (cudaGetLastError() != cudaSuccess) 
	{
	fprintf(stderr, "Cuda Error: Failed to execute kernel\n");	
	}

cudaMemcpy(array,dev_array,size*sizeof(float),cudaMemcpyDeviceToHost);
if (cudaGetLastError() != cudaSuccess) 
	{
	fprintf(stderr, "Cuda Error: Failed to copy to f2_array\n");	
	}

std::cout<<" "<<std::endl;
std::cout<<"Cuda New Array test: " <<std::endl;
for (int i=0;i<5;i++) {
	std::cout<<array[10*i] <<"  "<< array[20*i] <<"  "<< array[30*i] <<"  "<< array[40*i] <<"  "<< array[50*i] <<std::endl;
	} 

if (cudaFree(dev_array) != cudaSuccess) 
	{
	fprintf(stderr,"Failed to free dev_array\n");
	}

return 0;

}

nvcc f_to_f2.cu -I/usr/local/cuda/include -I/usr/include -L/usr/local/cuda/lib64 -lcudart -o f_to_f2

cuda-memcheck ./f_to_f2
========= CUDA-MEMCHECK

Cuda Array test:
0 0 0 0 0
10 20 30 40 50
20 40 60 80 100
30 60 90 120 150
40 80 120 160 200

Cuda Float2 Array test:
0 0 0 0 0
10 20 30 40 50
20 40 60 80 100
30 60 90 120 150
40 80 120 160 200
========= ERROR SUMMARY: 0 errors

Robert_Crovella · July 3, 2017, 4:49pm

In what way is the output incorrect? What do you believe is correct output?

AlexMagsam · July 3, 2017, 5:17pm

Shouldn’t the output be all ones? I apologize if I am missing something that is right in front of my face, but from the examples I’ve been studying this seems like the values stored in dev_array (1’s) should be copied to the variable array.

Robert_Crovella · July 3, 2017, 5:29pm

In your kernel code, change this:

if (x>0 && x<width && y>0 && y< height) {

to this:

if (x>=0 && x<width && y>=0 && y< height) {

You’ll note that I made that change from your code when I posted my example here:

https://devtalk.nvidia.com/default/topic/1016289/cuda-programming-and-performance/troubles-converting-float-array-to-float2-array/post/5177132/#5177132

In C and C++, usual indexing starts from 0. So your image array has a 0’th row (the first row is index 0) and your original thread check prevented any modifications to that row. That row has indices all the way up to 2448, and your test output printout only displayed outputs whose index was as high as 5*50 i.e. 250. So all of your test outputs are in the first row, and your kernel explicitly does not touch the first row (or the first column, for that matter).

Topic		Replies	Views
Troubles converting float array to float2 array CUDA Programming and Performance	11	3535	July 3, 2017
Need Help. CUDA kernel fails randomly CUDA Programming and Performance cuda , kernel	3	506	July 27, 2022
First CUDA program -- Integrating CUDA with existing code base -- not working. CUDA Programming and Performance	5	1212	June 13, 2017
My first program with CUDA need some help CUDA Programming and Performance	3	2563	August 10, 2009
Simple question on passing to the kernel CUDA Programming and Performance	15	3392	January 15, 2012
Help with cuda 2d array CUDA Programming and Performance	6	7452	September 29, 2014
Simple long operation test CUDA Programming and Performance	8	1361	April 19, 2013
Help with strange error CUDA Programming and Performance	8	2095	February 25, 2010
CUDA kernels giving bad results CUDA Programming and Performance	7	12507	February 17, 2011
simple program does not work - why? CUDA Programming and Performance	7	3398	June 13, 2008

Very simple CUDA program bad output

Related topics