Newbie problem with copying between host and device My image is just gr{a|e}y

Hi guys,

I’m programming a sort of real time raycaster working with voxels in an octree, but if I copy the calculated data from the device to the host it is just complete gray (the image is represented as an array of unsigned chars in the format RGBA). First I thought it was a problem with communicating between cuda c, c++ (which is compiled to a dll) and c# (which uses this dll) but i wrote a little program which uses these functions directly and it doesn’t work too (the raycasting algorithm isn’t the problem too, I’ve changed it to code which just colors the pixel computed by the kernel red and it is still gray). So I think the problem must be the data exchange from the device to the host.

Here is my current kernel invocation code:

unsigned char* h_image = (unsigned char*)malloc(sizeof(unsigned char) * width * height * 4);

unsigned char* d_image;

cudaMalloc((void**)&d_image, sizeof(unsigned char) * width * height * 4)

cudaDeviceProp deviceProp; 

cudaGetDeviceProperties(&deviceProp,  0);

		

dim3 blockDim(sqrt((float)deviceProp.maxThreadsPerBlock),sqrt((float)deviceProp.maxThreadsPerBlock));

dim3 gridDim(imageWidth/blockDim.x, imageHeight/blockDim.y);

sendRayKernel<<<gridDim,blockDim>>>(d_image, d_root, d_cam, scale);

cudaMemcpy(d_image,h_image, sizeof(unsigned char) * imageWidth * imageHeight * 4, cudaMemcpyDeviceToHost);

return h_image;

and the red painting kernel:

__global__ void sendRayKernel(unsigned char* image, node* root, cam* camera, int scale)

{

	vector3 pixel;

	pixel.x = (blockIdx.x * blockDim.x) + threadIdx.x;

	pixel.y = (blockIdx.y * blockDim.y) + threadIdx.y;

	pixel.z = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4] = 255;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 1] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 2] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 3] = 255;

}

All elements of the array should be 255 or 0 but all are 205 after running this code (and some more for initialising some vars and writing the array to a file, printing some of the chars or drawing a picture of them).

Please help me.

Check return values of all CUDA calls for errors. Probably your kernel is not launching at all.

I find this line highly suspicious:

dim3 blockDim(sqrt((float)deviceProp.maxThreadsPerBlock), sqrt((float)deviceProp.maxThreadsPerBlock));

Assuming you have a compute capability 1.x device, sqrt(deviceProp.maxThreadsPerBlock)≈22.63. Now since the C99 standard does not guarantee a direction of rounding, you might find that your compiler rounds this to 23. The kernel launch would then fail because of an invalid launch configuration.

Even if the compiler happens to round towards zero you would want blockDim.x to be a multiple of 16 (or 32 for compute capability 2.x devices) to achieve coalesced memory accesses. So the line should probably be replaced with either

dim3 blockDim(32, deviceProp.maxThreadsPerBlock/32);

or even simpler

dim3 blockDim(32, 16);

since this kernel would not profit from a larger blocksize anyway.

I’ve checked all CUDA calls and

cudaMemcpy(d_image,h_image, sizeof(unsigned char) * imageWidth * imageHeight * 4, cudaMemcpyDeviceToHost)

returns “cudaErrorInvalidValue” but I don’t understand what value is invalid.

Could you please help me?

btw: I’ve changed the highly suspicious line too…

The cudaMemcpy() call has [font=“Courier New”]d_image[/font] and [font=“Courier New”]h_image[/font] exchanged.

OK I’ve exchenged them, this is very confusing, because serveral sources (for example this one: http://www.clear.rice.edu/comp422/resources/cuda/html/group__CUDART__MEMORY_g48efa06b81cc031b2aa6fdc2e9930741.html) mention that the first argument is the destination and the second the source, but others say it is swapped. Now the cudaMemcpy returns cudaSuccess but the picture is still full off 205.

I don’t see any more problems with the code at the moment. Can you post some self-contained example?

The kernel:

#include <stdio.h>

#include <cuda.h>

typedef struct vector3

{

	int x;

	int y;

	int z;

	__device__ vector3 operator+(const vector3& a) const

	{

		vector3 vReturn = {a.x + x, a.y + y, a.z + z};

		return vReturn;

	}

	__device__ vector3 operator-(const vector3& a) const

	{

		vector3 vReturn = {a.x - x, a.y - y, a.z - z};

		return vReturn;

	}

	__device__ bool operator<(const vector3& a) const

	{

		return(x*x + y*y + z*z) < (a.x * a.x + a.y * a.y + a.z * a.z);

	}

} vector3;

typedef struct color

{

	unsigned char r;

	unsigned char g;

	unsigned char b;

	unsigned char a;

} color;

typedef struct cam

{

	vector3 eye;

	vector3 screen;

	int screenWidth;

	int screenHeight;

} cam;

__global__ void sendRayKernel(unsigned char* image, cam* camera)

{

	vector3 pixel;

	pixel.x = (blockIdx.x * blockDim.x) + threadIdx.x;

	pixel.y = (blockIdx.y * blockDim.y) + threadIdx.y;

	pixel.z = 0;

	

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4] = 255;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 1] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 2] = 0;

	image[(camera->screenWidth * 4) * pixel.y + pixel.x * 4 + 3] = 255;

	//should be all red now...

}

And the host code:

#include <kernel.cu>

#include <stdio.h>

#include <string.h>

#include <stdlib.h>

#include <math.h>

unsigned char* initImage(int width, int height)

{

	return (unsigned char*)malloc(sizeof(unsigned char) * width * height * 4);

}

unsigned char* initDeviceImage(int width, int height)

{

	unsigned char* d_image;

	if(cudaMalloc((void**)&d_image, sizeof(unsigned char) * width * height * 4) != cudaSuccess)

	{

		FILE *f = fopen("C:\debug.txt","a");

		if(f)

		{

			fprintf(f, "Failed to allocate device memory for d_image.");				

		}

	}

	return d_image;

}

cam* initCam(int eyeX, int eyeY, int eyeZ, int screenX, int screenY, int screenZ, int screenWidth, int screenHeight)

{

	cam* h_camera = (cam*)malloc(sizeof(cam));

	h_camera->eye.x = eyeX;

	h_camera->eye.x = eyeY;

	h_camera->eye.z = eyeZ;

	h_camera->screen.x = screenX;

	h_camera->screen.y = screenY;

	h_camera->screen.z = screenZ;

	h_camera->screenWidth = screenWidth;

	h_camera->screenHeight = screenHeight;

	cam* d_camera;

	cudaMalloc((void**)&d_camera, sizeof(cam));

	cudaMemcpy(d_camera, h_camera, sizeof(cam), cudaMemcpyHostToDevice);

	free(h_camera);

	return d_camera;

}

unsigned char* renderImage(int imageWidth, int imageHeight, cam* d_cam, unsigned char* h_image, unsigned char* d_image)

{	

	dim3 blockDim(32,16);

	dim3 gridDim(imageWidth/blockDim.x, imageHeight/blockDim.y);

	sendRayKernel<<<gridDim,blockDim>>>(d_image, d_cam);

	

	if(cudaMemcpy(h_image,d_image, sizeof(unsigned char) * imageWidth * imageHeight * 4, cudaMemcpyDeviceToHost) != cudaSuccess)

	{

		FILE *f = fopen("C:\debug.txt","a");

		if(f)

		{

			fprintf(f, "Error while copying image to host.");				

		}

	}

	return h_image;

}

//quick and dirty test program

int main( int argc, char** argv) 

{

	unsigned char* h_image = initImage(512,512);

	unsigned char* d_image = initDeviceImage(512,512);

	cam* camera = initCam(-250, -250, -500, 0, 0, -250, 500, 500);

	

	h_image = renderImage(512,512,camera, h_image, d_image);

	FILE* f = fopen("C:\debug_data.txt","a");

	if(f)

	{

		for(int i = 0; i < 512; i++)

		{

			for(int j = 0; j < 512; j++)

			{

				fprintf(f, "(%i|%i|%i|%i)", h_image[i * 4 * 512 + j * 4], h_image[i * 4 * 512 + j * 4 + 1], h_image[i * 4 * 512 + j * 4 + 2], h_image[i * 4 * 512 + j * 4 + 3]);

			}

			fprintf(f, "\n");

		}

	}

	free(h_image);

	cudaFree(d_image);

	cudaFree(cam);

	return 0;

}

The output data produced in C:\debug_data.txt are 4,25MB of “(205|205|205|205)” where it should be “(255|0|0|255)”.

Sorry for the double post

I’ve just played around with the code I’ve posted here and i saw that visual studio throws an exception to me: It says Microsoft C+±Exception: cudaError_enum … (some memory position I can’t remember it)
Could anyone explain this to me? (Is there a typo somewhere? And what enum does cuda mean?)

Sorry, I have no experience with Visual Studio.

I notice that you have

cudaFree(cam);

which can’t work as [font=“Courier New”]cam[/font] isn’t a variable, and that you call initCam with an image size of 500x500 even though the image is 512x512. Other than that, I can’t see a problem.