Getting an Error Using CudaMalloc3d

Hello,

So I’m brand new to coding, and I’ve sort of been thrown into the deep end here in that I’ve been tasked to write a simulation using Cuda. I’m currently trying to allocate a 3-D array on the GPU memory, and I’ve copied the source code giving in the programming guide more or less as is. However, when I try to compile it, it throws me an error on line 11 (the “char* devPtr = devPitchedPtr.ptr;” line) as follows.

“error : a value of type “void *” cannot be used to initialize an entity of type “char *””

I have no idea what to do about this. Does anyone have any ideas?

Code in its entirety is below. Thanks for your time.


#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);

__global__ void ThreeDimensional(cudaPitchedPtr devPitchedPtr, 
								 int width, int height, int depth)
{
	char* devPtr = devPitchedPtr.ptr;
	size_t pitch = devPitchedPtr.pitch;
	size_t slicePitch = pitch * height;
	for (int z = 0; z < depth; ++z) {
		char* slice = devPtr + z * slicePitch;
		for (int y = 0; y < height; ++y) {
			float* row = (float*)(slice + y*pitch);
			for (int x=0; x < width; ++x) {
				float element = row[x];
			}
		}
	}
}

int main()
{
	int width = 3, height = 3, depth = 3;
	cudaExtent extent = make_cudaExtent(width * sizeof(float), height, depth);

	cudaPitchedPtr devPitchedPtr;
	cudaMalloc3D(&devPitchedPtr, extent);
	
	ThreeDimensional<<<1, 1>>>(devPitchedPtr, width, height, depth);

	return(0);
}

Modify that line as follows:

char* devPtr = (char *)devPitchedPtr.ptr;

For the usage here, this type casting is safe and is necessary/typical/expected for pitched computations:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g32bd7a39135594788a542ae72217775c

Hi Txbob,

Thank you for the prompt response! The change works, but what exactly does it do to modify the code functionally?

Thanks again.

Best,

-BB

It doesn’t do anything for code generation insofar as that particular line of code is concerned (it is just an assignment statement). In C a pointer is simply a single number. Casting pointers from one type to another does not change the numerical value of the pointer. It simply indicates to the compiler that this number will now be used to reference a different type.

This concept is not unique to CUDA, so you can probably gain additional background by studying pointers in C.

The particular error reported in the code is due to the fact that CUDA is a dialect of C++, thus requiring the cast when assigning from a ‘void *’. Any introductory text on C++ should cover this, possibly while discussing the use of malloc().

Okay, thanks for the pointers (ha) I’ll take a look at some tutorials on the subject. If you guys have time, I’m also having a bit of difficulty parallelizing the following code. It works fine if I set the dim3 function in the kernel call to be dim3(16, 1, 1), i.e. I’m just calling 16 threads all indexed by x, but if I try using dim3(4, 4, 1) or something (so a 4x4 grid of threads), I only can fill in the first 4 elements of the d_out array. If I do dim3(8,2,1) it fills in the first 8 elements. So I’m not able to use the threadIdx.y component. I tried to simply translate the method of indexing over the memory space they give in the programming guide, but can’t figure out what I’m missing.

Edit: I should point out that this was just an excercise in the using the CudaMallocPitch function for me. Vestigial stuff aside, I’m just allocating space on the GPU, trying to write to all of it, copying back to the host, and then trying to display it.

__global__ void function(float* d_in, float* d_out, int width, int height, size_t pitch)
{  
	int x = threadIdx.x;
	int y = threadIdx.y;

	float* inData = (float*)((char*)d_in + y * pitch);
	float in = inData[x];

	float* outData = (float*)((char*)d_out + y * pitch);
	outData[x] = 3;  //simply trying to write 3 to each value of the "outgoing" array
		
	
}  

int main()
{
	const int width = 4, height = 4;

	//generate input and output arrays

	float h_in[width][height];
	for (int j = 0; j < height; j++)
	{
		for (int i = 0; i < width; i++)
		{
			h_in[i][j] = i+j;
		}
	}

	float h_out[width * height];

	//declare GPU pointers

	float *d_in;
	float *d_out;

	//declare pitch
	
	size_t pitch;

	//allocate device memory for source and destination

	cudaMallocPitch(&d_in, &pitch, width * sizeof(float), height);

	cudaMallocPitch(&d_out, &pitch, width * sizeof(float), height);

	//copy memory from host to device

	cudaMemcpy(d_in, h_in, width*height * sizeof(float), cudaMemcpyHostToDevice);

	//launch kernel

	function<<<1, dim3(4, 4, 1)>>>(d_in, d_out, width, height, pitch);

	//copy results back to host

	cudaMemcpy(h_out, d_out, width * height * sizeof(float), cudaMemcpyDeviceToHost);

	//print incoming array and outgoing array.

	for (int i = 0; i < height; i++) {
		printf("\n");
		for (int j = 0; j < width; j++) {
			printf("%.1f " , h_in[i][j]);
		}
	}

	printf("\n \n");

    for (int i = 0; i < width * height; i++) {
        printf("%.1f" , h_out[i]);
        printf(((i % width) != (width-1)) ? "\t" : "\n");
	}

	
	printf("\n");

	//free device memory

	cudaFree(d_in);
	cudaFree(d_out);
	return(0);
}

This seems somewhat “tedious”. I’m not sure I would want to structure code this way. But as a learning exercise, here is one possible approach:

  1. Create a globally unique thread index. For arbitrary 2 dimensional block/grid shapes, that could be something like this:
int idx = threadIdx.x + blockDim.x*blockIdx.x + (threadIdx.y + blockDim.y*blockIdx.y)*(gridDim.x*blockDim.x);

The creation of a globally unique index isolates the remainder of the steps from the vagaries of arbitrary grid shapes.

  1. Decompose the globally unique index into row and column indices:
int row = idx/width;
int col = idx%width;
  1. compute a pitched index into the array, based on computed row and column indices, using the methodology given in the documentation:

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g32bd7a39135594788a542ae72217775c

// T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

float *inData = (float *)((char*)d_in + row*pitch);    // effectively computing just row pointers here
float *outData = (float *)((char*)d_out + row*pitch);  // and here
float in = inData[col];
outData[col] = 3;

The “tedium” can be lessened if instead of allowing our code to take arbitrary grid shapes, we actually shape the grid to match the data. In that case, the row and col indices can be more directly computed:

int row = threadIdx.y+blockDim.y*blockIdx.y;
int col = threadIdx.x+blockDim.x*blockIdx.x;

Among other objections to the “tedium”, modulo arithmetic (and integer division) tend to be “slow” on GPUs.

So I’ve used the methodology you suggested, and wrote a global function as follows.

__global__ void function(float* d_in, float* d_out, size_t pitch, int width, int height)
{  
	int col = threadIdx.x + blockDim.y*blockIdx.y;
	int row = threadIdx.y + blockDim.x*blockIdx.x;

	float *inData = (float*)((char*)d_in + row * pitch);
	float *outData = (float*)((char*)d_out + row * pitch);

	float in = inData[col];
	outData[col] = 92;	
}

However, whenever I launch a kernel that uses say a 4x4 thread block instead of a 16 x 1 thread block, I only write to 4 elements of the outData block, instead of the 16 I’m trying to. Is there an error in the code above, or would it be somewhere else potentially?

You have not used the methodology I suggested. I never wrote this code anywhere:

int col = threadIdx.x + blockDim.y*blockIdx.y;
int row = threadIdx.y + blockDim.x*blockIdx.x;

If you want to use arbitrary grid/block shapes, you must go through the longer method I proposed.

The shorter method I proposed that computes row and col directly can only work when your 2D grid structure matches your 2D data set structure.

So if you create a pitched allocation of 4 rows of width 4, you can only use the shorter method if you create a 4x4 grid or threadblock.

First, I was having difficulty getting this reply tab to load. Most browsers when I logged in, the add Reply part of the website didn’t load. Is this common?

Second, I modified my code as follows.

__global__ void function(float* d_in, float* d_out, int width, int height, size_t pitch)
{  
	int idx = threadIdx.x + blockDim.x*blockIdx.x + (threadIdx.y + blockDim.y*blockIdx.y)*(gridDim.x*blockDim.x);
	int row = idx/width;
	int col = idx%width;


	float* inData = (float*)((char*)d_in + row * pitch);
	float in = inData[col];

	float* outData = (float*)((char*)d_out + row * pitch);
	outData[col] = in + 44;
}

When I run this (with height and width set to 4), I have 16 elements in the memory that I’ve allocated using CudaMallocPitch but I only write to the first 4, regardless of what type of thread dimension I use (16x1 or 4x4).

However, if I run the code as follows.

__global__ void function(float* d_in, float* d_out, int width, int height, size_t pitch)
{  
	int x = threadIdx.x;
	int y = threadIdx.y;

	float* inData = (float*)((char*)d_in + y * pitch);
	float in = inData[x];

	float* outData = (float*)((char*)d_out + y * pitch);
	outData[x] = in * 2;	
}

Then using 16x1 thread dimensions works, but using a 4x4 thread dimension doesn’t. Am I making a mistake in how I copy the data into and out of device? I’m using cudaMemcpy to write in and out. Should I be using something else?

Thanks for all of your help on this, I really appreciate the patience.

I use chrome with this board. Not sure what others use.

Yes, in typical usage, we use cudaMemcpy2D, not cudaMemcpy, for device allocations made with cudaMallocPitch.

Here’s a fully worked example using the “long” variant I suggested:

$ cat t357.cu
#include <stdio.h>

__global__ void function(float* d_in, float* d_out, int width, int height, size_t in_pitch, size_t out_pitch)
{
    int idx = threadIdx.x + blockDim.x*blockIdx.x + (threadIdx.y + blockDim.y*blockIdx.y)*(gridDim.x*blockDim.x);
    int row = idx/width;
    int col = idx%width;
// T* pElement = (T*)((char*)BaseAddress + Row * pitch) + Column;

    float *inData = (float *)((char*)d_in + row*in_pitch);    // effectively computing just row pointers here
    float *outData = (float *)((char*)d_out + row*out_pitch);  // and here
    float in = inData[col];
    outData[col] = in + 3;
}

int main()
{
        const int width = 4, height = 4;
        //generate input and output arrays
        float h_in[width][height];
        for (int j = 0; j < height; j++)
        {
                for (int i = 0; i < width; i++)
                {
                        h_in[i][j] = i+j;
                }
        }
        float h_out[width * height];
        //declare GPU pointers
        float *d_in;
        float *d_out;
        //declare pitch
        size_t in_pitch;
        size_t out_pitch;
        //allocate device memory for source and destination
        cudaMallocPitch(&d_in, &in_pitch, width * sizeof(float), height);
        cudaMallocPitch(&d_out, &out_pitch, width * sizeof(float), height);
        //copy memory from host to device
        cudaMemcpy2D(d_in, in_pitch, h_in, width*sizeof(float), width*sizeof(float), height, cudaMemcpyHostToDevice);
        //launch kernel
#ifndef USE_FLAT
        function<<<1, dim3(4, 4, 1)>>>(d_in, d_out, width, height, in_pitch, out_pitch);
#else
        function<<<1, 16>>>(d_in, d_out, width, height, in_pitch, out_pitch);
#endif
        //copy results back to host
        cudaMemcpy2D(h_out, width*sizeof(float),  d_out, out_pitch, width*sizeof(float), height, cudaMemcpyDeviceToHost);
        //print incoming array and outgoing array.
        for (int i = 0; i < height; i++) {
                printf("\n");
                for (int j = 0; j < width; j++) {
                        printf("%.1f " , h_in[i][j]);
                }
        }

        printf("\n \n");

    for (int i = 0; i < width * height; i++) {
        printf("%.1f" , h_out[i]);
        printf(((i % width) != (width-1)) ? "\t" : "\n");
        }
        printf("\n");
        //free device memory
        cudaFree(d_in);
        cudaFree(d_out);
        return(0);
}
$ nvcc -o t357 t357.cu
$ ./t357

0.0 1.0 2.0 3.0
1.0 2.0 3.0 4.0
2.0 3.0 4.0 5.0
3.0 4.0 5.0 6.0

3.0     4.0     5.0     6.0
4.0     5.0     6.0     7.0
5.0     6.0     7.0     8.0
6.0     7.0     8.0     9.0

$ nvcc -o t357 t357.cu -DUSE_FLAT
$ ./t357

0.0 1.0 2.0 3.0
1.0 2.0 3.0 4.0
2.0 3.0 4.0 5.0
3.0 4.0 5.0 6.0

3.0     4.0     5.0     6.0
4.0     5.0     6.0     7.0
5.0     6.0     7.0     8.0
6.0     7.0     8.0     9.0

$
  1. In this example, it should not be necessary to have a separate pitch variable for input and output arrays. But in the general case, you cannot assume that pitches will be the same, or re-use pitch variables for separate arrays.

  2. I’ve omitted proper cuda error checking for brevity of presentation. But I recommend it any time you’re having trouble with a cuda code. If you’re not sure what proper cuda error checking is, google “proper cuda error checking” and take the first hit.