pointer in global device memory

Hi,

I want to use a pointer in global device memory to store matrix values. I want it in global memory because ultimately I want it persistant in between kernel calls (in a loop) without having to transfer the result to host at each loop iteration.

I can successfully allocate the device memory (with cudaMallocPitch or cudaMalloc), transfer data to device and even back to host (with cudaMemcpy2D or cudaMemcpy). But what I cannot do is access or change this data inside a kernel. There is no compilation error, just no effect when reading or writing data at run time…

I have found 2 work arounds to the problem:

WA1: is to declare the device global variable as an array of array, transfer data to device with cudaMemcpyToSymbol and back to host with cudaMemcpyFromSymbol

Problem here is that I cannot use pitched memory and this is precisely what I want to do.

WA2: is to pass a pointer as argument of the kernel (allocated with cudaMallocPitch or cudaMalloc and data copied with cudaMemcpy2D or cudaMalloc)

Solutions based on this approached are largely favoured by most people as I guess the first examples of the CUDA_C_Programming_Guide are based on that.

It seems to work fine but I actually do not understand why as as I read from the CUDA_C_Best_Practices_Guide v3.2 (3.2.2.4), “Shared memory holds the parameters or arguments that are passed to kernels at launch”. Why would the data be persistant between kernel launches then??? Is it not what global memory is there for??? Also pitched memory is supposed to help coalescing of global memory, so there must be a way to achieve what I am trying.

Here is the minimalistic code that reproduces my problem:

My config: Ubuntu 10.10, Cuda 3.2, GeForce 9500 (1.0)

Your help will be much appreciated.

#include <stdio.h>

#include <cutil.h>

#include <iostream>

#include <assert.h>

#include <vector>

#include <fstream>

#include <cutil_inline.h>

#define NBPOP		1024		

#define NBDIM		20		

//--------------------------------------------------------------------------

__device__ float d_Population[NBPOP*NBDIM];

__device__ float* d_Population2;

__device__ float d_E1;

__device__ float d_E2;

__device__ float d_E3;

__device__ float d_E4;

//--------------------------------------------------------------------------

//GPU KERNELS

__global__ void TestKernel(int count)

{	

	

	int tid = (blockIdx.y * gridDim.x + blockIdx.x)  * blockDim.x + threadIdx.x;

	if(tid<count)

	{		

		if(tid==0)

		{

			d_E1=d_Population[tid*NBDIM+0];

			d_E2=d_Population[tid*NBDIM+1];

			

			

		}

		for(int j=0;j<NBDIM;j++)

			d_Population[tid*NBDIM+j]+=3;

		

		

		if(tid==0)

		{

			d_E3=d_Population[tid*NBDIM+0];

			d_E4=d_Population[tid*NBDIM+1];

		}

		

	}	

	

}

//--------------------------------------------------------------------------

__global__ void TestKernel2(int count,size_t pitch)

{	

	

	int tid = (blockIdx.y * gridDim.x + blockIdx.x)  * blockDim.x + threadIdx.x;

	if(tid<count)

	{		

		

		if(tid==0)

		{

			d_E1=d_Population2[tid*pitch+0];

			d_E2=d_Population2[tid*pitch+1];			

		}

		for(int j=0;j<NBDIM;j++)

		d_Population2[tid*pitch+j]+=3;

		

		

		if(tid==0)

		{

			d_E3=d_Population2[tid*pitch+0];

			d_E4=d_Population2[tid*pitch+1];

			

		}

	

	}	

	

}

//--------------------------------------------------------------------------

__global__ void TestKernel3(float* devPtr,int count,size_t pitch)

{	

	

	int tid = (blockIdx.y * gridDim.x + blockIdx.x)  * blockDim.x + threadIdx.x;

	if(tid<count)

	{	

		if(tid==0)

		{

			d_E1=devPtr[tid*pitch+0];

			d_E2=devPtr[tid*pitch+1];			

			

		}

		for(int j=0;j<NBDIM;j++)

			devPtr[tid*pitch+j]+=3;

		

		

		if(tid==0)

		{

			d_E3=devPtr[tid*pitch+0];

			d_E4=devPtr[tid*pitch+1];

		}

	

	}	

	

}

//--------------------------------------------------------------------------

////////////////////////////////////////////////////////////////////////////////

// Main program

////////////////////////////////////////////////////////////////////////////////

int main(int argc, char **argv)

{

	

	

	std::cout << "Initializing device data...\n";	

	CUT_DEVICE_INIT(argc, argv);

	//grid dimension

	int dataAmount = NBPOP;

	dim3 dimBlocs(64,1,1);				

	int nbblocs=(dataAmount+dimBlocs.x-1)/dimBlocs.x;

	dim3 dimGrid; 						

		

	if (nbblocs>65535)

	{

		dimGrid.x=65535;

		dimGrid.y=(nbblocs+65535-1)/65535;

	}

	else

	{

		dimGrid.x=nbblocs;

		dimGrid.y=1;

	}

	float h_Population[NBPOP*NBDIM];	

	float h_Population2[NBPOP*NBDIM];

	for(int i=0;i<NBPOP;i++)

	{

		for(int j=0;j<NBDIM;j++)

		{

			h_Population[i*NBDIM+j]=i*NBDIM+j;			

			h_Population2[i*NBDIM+j]=-1;

		}

	}

	float h_E1,h_E2,h_E3,h_E4;

	h_E1=0;

	h_E2=0;

	h_E3=0;

	h_E4=0;

	

	

	//COPY TO DEVICE

	cudaError_t cerr;

	std::cout << "sizeof(h_Population) "  << sizeof(h_Population) << "\n";

	std::cout << "sizeof(float) * NBPOP * NBDIM "  << sizeof(float) * NBPOP * NBDIM << "\n";

	//DESIRED VERSION

	//h_E1,h_E2,h_E3,h_E4 are rubish, h_Population2 holds values of h_Population (ie data transfer worked fine but kernel has no effect)

	size_t pitch;

	cerr=cudaMallocPitch(&d_Population2, &pitch, NBDIM * sizeof(float), NBPOP); //est ce qu'on peut faire ca directement?

	std::cout << "pitch " << pitch << "\n";

	std::cout << "pitch/sizeof(float) " << pitch/sizeof(float) << "\n";

	

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	//Copies hostArray onto the pre-allocated device memory

	cerr=cudaMemcpy2D(d_Population2, pitch, &h_Population[0], NBDIM * sizeof(float) , NBDIM * sizeof(float), NBPOP, cudaMemcpyHostToDevice); //ok

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	

	// kernel call GPU		

	TestKernel2<<<dimGrid, dimBlocs>>>(NBPOP,pitch/sizeof(float));

	CUT_CHECK_ERROR("TestKernel2() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	// Copy the data back to the host

	cerr=cudaMemcpyFromSymbol(&h_E1,d_E1,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E2,d_E2,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E3,d_E3,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E4,d_E4,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	std::cout << "E1 : " << h_E1 << " E2 : " << h_E2 << " E3 : " << h_E3 << " E4 : " << h_E4  << "\n";

	std::cout << "h_Population2[0] before: " << h_Population2[0] << "\n";

	cerr=cudaMemcpy2D(&h_Population2[0], NBDIM * sizeof(float), d_Population2, pitch , NBDIM * sizeof(float), NBPOP, cudaMemcpyDeviceToHost); //ok

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	std::cout << "h_Population2[0] after: " << h_Population2[0] << "\n"; 

	

	cudaFree(d_Population2);

	

/*

	 

	//WORK AROUND 1

	//h_E1=0,h_E2=1,h_E3=3,h_E4=4. h_Population2 holds values of h_Population shifted by 3. Everything as expected.

	cerr=cudaMemcpyToSymbol(d_Population,&h_Population[0],sizeof(h_Population));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	// kernel call GPU		

	TestKernel<<<dimGrid, dimBlocs>>>(NBPOP);

	CUT_CHECK_ERROR("TestKernel() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	// Copy the data back to the host

	cerr=cudaMemcpyFromSymbol(&h_E1,d_E1,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E2,d_E2,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E3,d_E3,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E4,d_E4,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	std::cout << "E1 : " << h_E1 << " E2 : " << h_E2 << " E3 : " << h_E3 << " E4 : " << h_E4 << "\n";

	cerr=cudaMemcpyFromSymbol(&h_Population2[0],d_Population,sizeof(h_Population2));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

*/

/*

	//WORK AROUND 2

	//h_E1=0,h_E2=1,h_E3=3,h_E4=4. h_Population2 holds values of h_Population shifted by 3. Everything as expected.

	size_t pitch2;

	float* devPtr;

	cerr=cudaMallocPitch(&devPtr, &pitch2, NBDIM * sizeof(float), NBPOP);

	std::cout << "pitch2 " << pitch2 << "\n";

	

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	//Copies hostArray onto the pre-allocated device memory

	cerr=cudaMemcpy2D(devPtr, pitch2, &h_Population[0], NBDIM * sizeof(float) , NBDIM * sizeof(float), NBPOP, cudaMemcpyHostToDevice); //ca plutot

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	

	// kernel call GPU		

	TestKernel3<<<dimGrid, dimBlocs>>>(devPtr,NBPOP,pitch2/sizeof(float));

	CUT_CHECK_ERROR("TestKernel3() execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	//TestKernel3<<<dimGrid, dimBlocs>>>(devPtr,NBPOP,pitch2/sizeof(float));

	//CUT_CHECK_ERROR("TestKernel3() execution failed\n");

	//CUDA_SAFE_CALL( cudaThreadSynchronize() );

	// Copy the data back to the host

	cerr=cudaMemcpyFromSymbol(&h_E1,d_E1,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E2,d_E2,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E3,d_E3,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cerr=cudaMemcpyFromSymbol(&h_E4,d_E4,sizeof(float));

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	

	std::cout << "E1 : " << h_E1 << " E2 : " << h_E2 << " E3 : " << h_E3 << " E4 : " << h_E4 << "\n";

	cerr=cudaMemcpy2D(&h_Population2[0], NBDIM * sizeof(float), devPtr, pitch2 , NBDIM * sizeof(float), NBPOP, cudaMemcpyDeviceToHost); //???

	if( cerr != cudaSuccess ) puts(cudaGetErrorString( cerr ));

	cudaFree(devPtr);

*/

		

	//logfile

	std::string filename="/home/ubuntu/NVIDIA_GPU_Computing_SDK/logGPU.txt";

	std::ofstream fichier;

	fichier.open(filename.c_str());

	fichier.precision(15);

	for (int i=0; i < NBPOP; i++)

	{

		for (int j=0; j < NBDIM; j++)

		{

			fichier << i << "\t" << j << "\t" << h_Population2[i*NBDIM+j] << "\n" ;

		} 

	}

	fichier.close();

	//cleaning

	std::cout << "Cleaning up...\n";

		

	

cudaThreadExit();

cutilExit(argc,argv);

}

template2.cu (8.57 KB)

I guess one man’s minimalism is another’s gauche rococo. The 147 lines of commented out code included for our reading pleasure was a particularly nice touch.

Global memory pointers and their manipulation inside kernels are an absolutely fundamental element of CUDA. The fact you can’t make your code work only means your code or your understanding of how pointers work is wrong. I very much doubt you will find anyone bored or patient enough to wade through the 350 lines of mostly irrelevent code and white space to tell you where you are going wrong. I certainly won’t be.

The CUDA SDK contains about 80 complete working CUDA codes, just about all of which contain working examples of reading and writing global memory in device code. Maybe you could look at those for some hints.

The problem is that you try to pass device pointers host functions, which does not work:

__device__ float* d_Population2;

cudaMallocPitch(&d_Population2, &pitch, NBDIM * sizeof(float), NBPOP);

cudaMemcpy2D(d_Population2, pitch, &h_Population[0], NBDIM * sizeof(float) , NBDIM * sizeof(float), NBPOP, cudaMemcpyHostToDevice);

TestKernel2<<<dimGrid, dimBlocs>>>(NBPOP,pitch/sizeof(float));

[font=“Courier New”]d_Population2[/font] is a device variable, but a host variable is expected here.

Apparently in your case [font=“Courier New”]&d_Population2[/font] (the address of [font=“Courier New”]d_Population2[/font] on the device) by pure chance on the host also seems to be an address that you have write permission to, so that you don’t get an error message, however the pointer still ends up on the host instead of the device. So you need to create a host variable and then transfer it to the device:

float *h_Population2;

__device__ float* d_Population2;

cudaMallocPitch(&h_Population2, &pitch, NBDIM * sizeof(float), NBPOP);

cudaMemcpy2D(h_Population2, pitch, &h_Population[0], NBDIM * sizeof(float) , NBDIM * sizeof(float), NBPOP, cudaMemcpyHostToDevice);

cudaMemcopyToSymbol("d_Population2", &h_Population2, sizeof(d_Population2, 0, cudaMemcpyHostToDevice);

TestKernel2<<<dimGrid, dimBlocs>>>(NBPOP,pitch/sizeof(float));

Note when you have pointers, there are (at least) three ways to distribute them between host and device memory:

[list=1]

[*]both pointer and memory pointed to on the host.

[*]pointer on the host, memory pointed to on the device.

[*]both pointer and memory pointed to on the device.

[font=“Courier New”]malloc()[/font] gives you pointers of the first type (since it does not know anything about CUDA).

[font=“Courier New”]cudaMalloc()[/font] gives you pointers of the second type (by writing into the pointer to pointer you give as first argument).

Kernels work with pointers of the third type. So you need to transfer the pointer itself from the host to the device. One way to achieve that is giving it as an argument to the kernel, and the runtime library will take care of the host->device copy of the pointer.

Another way is to use cudaMemcopyTo Symbol(), which apparently is what you want to do.

Your workarounds seem to indicate that you are aware of the issue. I don’t know it was an oversight or a fundamental misunderstanding, so I tried to give a bit more detail than necessary probably.

EDIT: Once again Avidday was quicker. This time however I was able to prove him wrong: External Image

I have to agree on the whitespace part though, which makes reading the code really annoying.

Thanks a lot tera! Your answer is spot on and very valuable to me! And thanks a lot for the great explanation, that was definitely needed. I wish the official documentation was that concise.

PS: Sorry for the whitespaces, They were added after I pressed the send button, they are definitely not in the attached file from which the code was copied-pasted…

PSS: For the record, for my given problem, using pitched memory makes very little difference (vs non pitched, even though my gpu has compute capabilities 1.0). For a 1024x20 float matrix (1024x64 pitched), it is even slightly worse. For a 1024*63 float matrix, it is slightly better, so I assume there is a tradeoff between coalescence and size of extra padded data…?

I’d like to ask you that what about the third case?How can pass it from device to host?I am now doing this work ,but have no result.

I’m also wondering about this third case.

I can use the cudaMemcopytToSymbol to copy to a struct that’s already allocated (at compile time) in the kernel, but how to allocate memory to a pointer defined in the kernel, then copy data to that memory from the host?

So I would have something like this in kernel.cu

__device__ float *dev_data;

and in main.c

float *data, *dev_data;

cudaMemcpyToSymbol ("dev_data", data, nBytes, 0, cudaMemcpyHostToDevice);

But that doesn’t work, presumably because *dev_data doesn’t have any storage allocated on the device. (The call returns an "Error 11 - Invalid argument.) If I do a cudaMalloc (&dev_data, nBytes) in main.c, dev_data there is not the same as dev_data in kernel.c.

For background, “data” is one of many arrays (the sizes of which are determined at run time) that provide background information for computations which will be carried out by many (potentially millions) of calls to several different kernels. I could put all the pointers in the kernel calls, a la “CUDA By Example”, but that would give me argument lists spanning half a page, to provide data that seldom or never changes. Much simpler/clearer, I think, to have one function to do all the data copying as part of initialization, but how?

Thanks,

James

It’s not entirely clear to me what you are trying to achieve. Why would you want to allocate memory on the device from a kernel, when in the next step you are copying into the allocated memory from the host? You could just as well both allocate and copy on the host…

It is however possible to do what you are asking for using cudaMemcpyFromSymbol() on the host to find out the address (and size) of the memory that was allocated from inside a kernel.

Sorry, I obviously wasn’t clear enough. I want to allocate memory on the device from host C code, then reference it inside kernels (there will be several different ones running at different points), as say “dev_data [idx]”. As a concrete example, say I’m computing a problem on an irregularly-spaced 2D grid, the coordinates of which are read as program input, and placed in (malloc’ed) arrays Xp and Yp. So in the host side I do cudaMalloc and cudaMemcpy to get copies of those arrays on the device. I have device pointers for that memory on the host side, say in dev_Xp and dev_Yp. But how do I get names to reference that memory on the device, except by passing the pointer values as parameters?

Currently I’m using a workaround in which I define a structure that holds pointers to the various arrays, have a copy on both host and device, fill the host copy the values returned from the cudaMallocs, do a cudaGetSymbolAddress of the device copy, and a cudaMemcpy to transfer the host data to device. It works, but I think it’s rather sloppy & less than clear to anyone who tries to read the code.

I believe the second argument should be &data, and the third should be sizeof(*float), because you are copying only the pointer, not the contents.

That works! Now why didn’t I think of that? Seems obvious once I see it.

Thanks,

James