Newbie problem with copying between host and device My image is just gr{a|e}y

darthcookie · May 28, 2011, 1:56pm

Hi guys,

I’m programming a sort of real time raycaster working with voxels in an octree, but if I copy the calculated data from the device to the host it is just complete gray (the image is represented as an array of unsigned chars in the format RGBA). First I thought it was a problem with communicating between cuda c, c++ (which is compiled to a dll) and c# (which uses this dll) but i wrote a little program which uses these functions directly and it doesn’t work too (the raycasting algorithm isn’t the problem too, I’ve changed it to code which just colors the pixel computed by the kernel red and it is still gray). So I think the problem must be the data exchange from the device to the host.

Here is my current kernel invocation code:

unsigned char* h_image = (unsigned char*)malloc(sizeof(unsigned char) * width * height * 4);

unsigned char* d_image;

cudaMalloc((void**)&d_image, sizeof(unsigned char) * width * height * 4)

cudaDeviceProp deviceProp; 

cudaGetDeviceProperties(&deviceProp,  0);

		

dim3 blockDim(sqrt((float)deviceProp.maxThreadsPerBlock),sqrt((float)deviceProp.maxThreadsPerBlock));

dim3 gridDim(imageWidth/blockDim.x, imageHeight/blockDim.y);

sendRayKernel<<<gridDim,blockDim>>>(d_image, d_root, d_cam, scale);

cudaMemcpy(d_image,h_image, sizeof(unsigned char) * imageWidth * imageHeight * 4, cudaMemcpyDeviceToHost);

return h_image;

and the red painting kernel:

__global__ void sendRayKernel(unsigned char* image, node* root, cam* camera, int scale)

{

	vector3 pixel;

	pixel.x = (blockIdx.x * blockDim.x) + threadIdx.x;

	pixel.y = (blockIdx.y * blockDim.y) + threadIdx.y;

	pixel.z = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4] = 255;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 1] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 2] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 3] = 255;

}

All elements of the array should be 255 or 0 but all are 205 after running this code (and some more for initialising some vars and writing the array to a file, printing some of the chars or drawing a picture of them).

Please help me.

tera · May 28, 2011, 2:43pm

Check return values of all CUDA calls for errors. Probably your kernel is not launching at all.

I find this line highly suspicious:

dim3 blockDim(sqrt((float)deviceProp.maxThreadsPerBlock), sqrt((float)deviceProp.maxThreadsPerBlock));

Assuming you have a compute capability 1.x device, sqrt(deviceProp.maxThreadsPerBlock)â‰ˆ22.63. Now since the C99 standard does not guarantee a direction of rounding, you might find that your compiler rounds this to 23. The kernel launch would then fail because of an invalid launch configuration.

Even if the compiler happens to round towards zero you would want blockDim.x to be a multiple of 16 (or 32 for compute capability 2.x devices) to achieve coalesced memory accesses. So the line should probably be replaced with either

dim3 blockDim(32, deviceProp.maxThreadsPerBlock/32);

or even simpler

dim3 blockDim(32, 16);

since this kernel would not profit from a larger blocksize anyway.

darthcookie · May 28, 2011, 3:20pm

I’ve checked all CUDA calls and

cudaMemcpy(d_image,h_image, sizeof(unsigned char) * imageWidth * imageHeight * 4, cudaMemcpyDeviceToHost)

returns “cudaErrorInvalidValue” but I don’t understand what value is invalid.

Could you please help me?

btw: I’ve changed the highly suspicious line too…

tera · May 28, 2011, 3:27pm

The cudaMemcpy() call has [font=“Courier New”]d_image[/font] and [font=“Courier New”]h_image[/font] exchanged.

darthcookie · May 28, 2011, 4:04pm

OK I’ve exchenged them, this is very confusing, because serveral sources (for example this one: http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html) mention that the first argument is the destination and the second the source, but others say it is swapped. Now the cudaMemcpy returns cudaSuccess but the picture is still full off 205.

tera · May 28, 2011, 6:19pm

I don’t see any more problems with the code at the moment. Can you post some self-contained example?

darthcookie · May 29, 2011, 10:31am

The kernel:

#include <stdio.h>

#include <cuda.h>

typedef struct vector3

{

	int x;

	int y;

	int z;

	__device__ vector3 operator+(const vector3& a) const

	{

		vector3 vReturn = {a.x + x, a.y + y, a.z + z};

		return vReturn;

	}

	__device__ vector3 operator-(const vector3& a) const

	{

		vector3 vReturn = {a.x - x, a.y - y, a.z - z};

		return vReturn;

	}

	__device__ bool operator<(const vector3& a) const

	{

		return(x*x + y*y + z*z) < (a.x * a.x + a.y * a.y + a.z * a.z);

	}

} vector3;

typedef struct color

{

	unsigned char r;

	unsigned char g;

	unsigned char b;

	unsigned char a;

} color;

typedef struct cam

{

	vector3 eye;

	vector3 screen;

	int screenWidth;

	int screenHeight;

} cam;

__global__ void sendRayKernel(unsigned char* image, cam* camera)

{

	vector3 pixel;

	pixel.x = (blockIdx.x * blockDim.x) + threadIdx.x;

	pixel.y = (blockIdx.y * blockDim.y) + threadIdx.y;

	pixel.z = 0;

	

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4] = 255;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 1] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 2] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 3] = 255;

	//should be all red now...

}

And the host code:

#include <kernel.cu>

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

#include <math.h>

unsigned char* initImage(int width, int height)

{

	return (unsigned char*)malloc(sizeof(unsigned char) * width * height * 4);

}

unsigned char* initDeviceImage(int width, int height)

{

	unsigned char* d_image;

	if(cudaMalloc((void**)&d_image, sizeof(unsigned char) * width * height * 4) != cudaSuccess)

	{

		FILE *f = fopen("C:\debug.txt","a");

		if(f)

		{

			fprintf(f, "Failed to allocate device memory for d_image.");				

		}

	}

	return d_image;

}

cam* initCam(int eyeX, int eyeY, int eyeZ, int screenX, int screenY, int screenZ, int screenWidth, int screenHeight)

{

	cam* h_camera = (cam*)malloc(sizeof(cam));

	h_camera->eye.x = eyeX;

	h_camera->eye.x = eyeY;

	h_camera->eye.z = eyeZ;

	h_camera->screen.x = screenX;

	h_camera->screen.y = screenY;

	h_camera->screen.z = screenZ;

	h_camera->screenWidth = screenWidth;

	h_camera->screenHeight = screenHeight;

	cam* d_camera;

	cudaMalloc((void**)&d_camera, sizeof(cam));

	cudaMemcpy(d_camera, h_camera, sizeof(cam), cudaMemcpyHostToDevice);

	free(h_camera);

	return d_camera;

}

unsigned char* renderImage(int imageWidth, int imageHeight, cam* d_cam, unsigned char* h_image, unsigned char* d_image)

{	

	dim3 blockDim(32,16);

	dim3 gridDim(imageWidth/blockDim.x, imageHeight/blockDim.y);

	sendRayKernel<<<gridDim,blockDim>>>(d_image, d_cam);

	

	if(cudaMemcpy(h_image,d_image, sizeof(unsigned char) * imageWidth * imageHeight * 4, cudaMemcpyDeviceToHost) != cudaSuccess)

	{

		FILE *f = fopen("C:\debug.txt","a");

		if(f)

		{

			fprintf(f, "Error while copying image to host.");				

		}

	}

	return h_image;

}

//quick and dirty test program

int main( int argc, char** argv) 

{

	unsigned char* h_image = initImage(512,512);

	unsigned char* d_image = initDeviceImage(512,512);

	cam* camera = initCam(-250, -250, -500, 0, 0, -250, 500, 500);

	

	h_image = renderImage(512,512,camera, h_image, d_image);

	FILE* f = fopen("C:\debug_data.txt","a");

	if(f)

	{

		for(int i = 0; i < 512; i++)

		{

			for(int j = 0; j < 512; j++)

			{

				fprintf(f, "(%i|%i|%i|%i)", h_image[i * 4 * 512 + j * 4], h_image[i * 4 * 512 + j * 4 + 1], h_image[i * 4 * 512 + j * 4 + 2], h_image[i * 4 * 512 + j * 4 + 3]);

			}

			fprintf(f, "\n");

		}

	}

	free(h_image);

	cudaFree(d_image);

	cudaFree(cam);

	return 0;

}

The output data produced in C:\debug_data.txt are 4,25MB of “(205|205|205|205)” where it should be “(255|0|0|255)”.

darthcookie · May 30, 2011, 8:14pm

Sorry for the double post

I’ve just played around with the code I’ve posted here and i saw that visual studio throws an exception to me: It says Microsoft C+±Exception: cudaError_enum … (some memory position I can’t remember it)
Could anyone explain this to me? (Is there a typo somewhere? And what enum does cuda mean?)

tera · May 30, 2011, 9:13pm

Sorry, I have no experience with Visual Studio.

I notice that you have

cudaFree(cam);

which can’t work as [font=“Courier New”]cam[/font] isn’t a variable, and that you call initCam with an image size of 500x500 even though the image is 512x512. Other than that, I can’t see a problem.

Topic		Replies	Views
program crash when copying from device to host <br /> CUDA Programming and Performance	11	1877	March 31, 2009
unsigned char pointer in a kernel CUDA Programming and Performance	6	4823	March 18, 2009
cudaMemcpy() returns success and copy incorrect data CUDA Programming and Performance	3	2134	March 4, 2017
Cuda kernel have different results between host machine and docker container CUDA Programming and Performance cuda , kernel	3	260	December 26, 2023
Second call to cudaMemcpy3D crashes my app First call is successful CUDA Programming and Performance	14	8807	November 30, 2010
cudaMemcpy problem problem with using cudaMemcpy CUDA Programming and Performance	12	5895	January 5, 2010
OpenCV Image loading in CUDA texture CUDA Programming and Performance	11	2302	October 12, 2021
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	597	March 8, 2023
cudaMemcpy error, all data not being transferred. but cudaMemcpy returns cudaSuccess CUDA Programming and Performance	5	15213	August 7, 2019
CUDA crash and black screen after addAtomic CUDA Programming and Performance	25	6398	July 23, 2009

Newbie problem with copying between host and device My image is just gr{a|e}y

Related topics