Strange memory gremlins Getting pwned by pointers

Hi. In the process of learning CUDA I’ve run into a strange problem, and I’ve been unable to figure out what exactly is going wrong. I’ve written a simple program, which is very close to (and takes a lot from) one of the examples in the programming guide. In the first case, all goes as expected, but in the second case, the output of the third vector (the result of the kernel calculation) is the same as it was before the kernel call. With the second version, I get the warning, “Advisory: Cannot tell what pointer points to, assuming global memory space” when I compile. Anyone have any ideas what is going on here, and why it matters that I pass the vectors individually rather than as an array?

Thanks much,

Paul

Working code:

[codebox]#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

// Device code

global void VecAdd(float* A, float* B, float* C, int N)

{ 

	int i = threadIdx.x + blockIdx.x * blockDim.x; 

	//if (i < N) 

		

	

	C[i] = A[i]+B[i];		

} 

using namespace std;

int main (int argc, const char * argv)

{ 

	int N = 10;

	size_t size = N * sizeof(float); 

	

	float* h_A;

	float* h_B;

	float* h_C;

	float* d_A;

	float* d_B;

	float* d_C;

//Allocate vector A in host and device memory, set default value

	cudaMallocHost((void**)&(h_A), size);

	cudaMalloc((void**)&(d_A), size);

	

	for(int itr=0; itr<N; itr++)

	{

		h_A[itr] = 5.5;

	}

//End Allocation

//Allocate vector B in host and device memory, set default value

	cudaMallocHost((void**)&(h_B), size);

	cudaMalloc((void**)&(d_B), size);

	

	for(int itr=0; itr<N; itr++)

	{

		h_B[itr] = 5.5;

	}

//End Allocation

//Allocate vector C in host and device memory, set default value

	cudaMallocHost((void**)&(h_C), size);

	cudaMalloc((void**)&(d_C), size);

	

	for(int itr=0; itr<N; itr++)

	{

		h_C[itr] = 5.5;

	}

//End Allocation

	cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

	

	cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_C, h_C, size, cudaMemcpyHostToDevice);



	// Invoke kernel 

	int threadsPerBlock = 256; 

	int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; 

	VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

	cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); 

	for(int incr=0;incr<N; incr++)

	{

		

		printf("[");

		printf("%i",incr);

		printf("]\t");

		printf("%f",h_A[incr]);

		printf("\t");

		printf("%f",h_B[incr]);

		printf("\t");

		printf("%f",h_C[incr]);

		printf("\n");

	}

	// Free device memory 

	cudaFree(d_A); 

	cudaFree(d_B); 

	cudaFree(d_C); 

	// Free host memory 

	cudaFreeHost(h_A); 

	cudaFreeHost(h_B); 

	cudaFreeHost(h_C); 

		

	

return 0;

}

[/codebox]

Misbehaving Code:

[codebox]#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

// Device code

global void VecAdd(float** C, int N)

{ 

	int i = threadIdx.x + blockIdx.x * blockDim.x; 

	//if (i < N) 

		

	

	C[2][i] = C[0][i]+C[1][i];		

} 

using namespace std;

int main (int argc, const char * argv)

{ 

	int N = 10;

	size_t size = N * sizeof(float); 

	

	float* hostPtr[3];

	float* devPtr[3];

			



	for (int i=0; i<3;i++)

	{

		printf(cudaGetErrorString(cudaMallocHost((void**)&(hostPtr[i]), size)));

		printf("\n");

		printf(cudaGetErrorString(cudaMalloc((void**)&(devPtr[i]), size)));

		printf("\n");

		for(int itr=0; itr<N; itr++)

		{

			hostPtr[i][itr] = 5.5;

		}

	}

	

	for(int i = 0; i < 3;i++)

	{	

		cudaMemcpy(devPtr[i], hostPtr[i], size, cudaMemcpyHostToDevice);

	}

	// Invoke kernel 

	int threadsPerBlock = 256; 

	int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; 

	VecAdd<<<blocksPerGrid, threadsPerBlock>>>(devPtr, N);

	cudaMemcpy(hostPtr[2], devPtr[2], size, cudaMemcpyDeviceToHost); 

	for(int incr=0;incr<N; incr++)

	{

		

		printf("[");

		printf("%i",incr);

		printf("]\t");

		printf("%f",hostPtr[0][incr]);

		printf("\t");

		printf("%f",hostPtr[1][incr]);

		printf("\t");

		printf("%f",hostPtr[2][incr]);

		printf("\n");

	}

	// Free device memory 

	cudaFree(devPtr[0]); 

	cudaFree(devPtr[1]); 

	cudaFree(devPtr[2]); 

	// Free host memory 

	cudaFreeHost(hostPtr[0]); 

	cudaFreeHost(hostPtr[1]); 

	cudaFreeHost(hostPtr[2]); 

		

	

return 0;

}

[/codebox]

My guess would be that devPtr is still a host array even though it contains pointers to device memory. So in your kernel you are attempting to index into host memory which is not allowed.

So how would I fix that?

I did more testing and it turns out that under emulation on my laptop the error does not occur. I haven’t been able to compile for emulation on our remote tesla node though, so I’m not sure if that will show up there as well.

nvcc tries to track the memory space (shared versus global) of pointers in device code, but sometimes it gets confused, as it does in your case. I assume your use of pointer to pointers is the culprit. When it can’t figure it out, it throws up its hands and just assumes the pointer refers to the global memory space, and you see this advisory. Unfortunately, there’s no cast that exists which would disambiguate the problem and eliminate the advisory.

In your case, this assumption ought to be correct. Does your code work as expected otherwise?

Negative. After the kernel call, the memory that is supposed to have been changed, is not. So that result vector, d_C, each element of which ought to be d_A + d_B, retains the same value it had before the kernel call (note, it works in the case where I’m using three individual vectors, but fails when I try to pass an array of vectors to the kernel).

If I can’t get rid of the warning but end up with otherwise functional code, then that doesn’t bother me too much. I just want to get the incorrect value thing worked out.

As jgoffeney points out, the problem stems from the fact that you’ve created a host array of device pointers, when in fact what you intend is a device array of device pointers.

In other words, instead of

// the array devPtr lives in the host memory space

float* devPtr[3];

you need to do something like this:

// devPtr will point to an array of pointers which live in the device memory space

float **devPtr = 0;

cudaMalloc((void**)&devPtr, sizeof(float *) * 3);

Then call cudaMalloc for each of the three arrays. Make sense? It gets tricky though. You will need to copy the result of each of the three cudaMallocs to the pointers at devPtr[0], devPtr[1], and devPtr[2] in using cudaMemcpy. In total, you have four device arrays: three arrays of floats, and then one more array of float *, where each element points to one of the three originals.

Actually this is one item I have predicated “not to do” for a long time. What happends is that in Informatics classes, and either the “Numerical Recipes in C” guys have miseducated for some time.

What happend is that C has (some) support only for static multidimensional arrays. You, as many other, have found that with an identical syntax you can make an array of arrays looks like a multidimensional dynamic array. Sorry, it is not.

Simply avoid as the Hell array of arrays if not really needed (rows with very different lengths, not the case of linear algebra codes for sure), since, if they can lower your performance on typical platforms, they will be performance killers in a CUDA architecture (no cache, remember…?)

So instead of writing C[i][j] it will be C[i*N+j]. It will be your care to write the code to handle the multidimensional indexes: C only has 1D dynamic arrays. Ok, it looks worse, but C is not Fortran90, and this is the correct way to operate in scientific codes. This will save to you a lot of troubles.

[quote name=‘JaredHoberock’ post=‘559600’ date=‘Jun 30 2009, 12:08 AM’]

As jgoffeney points out, the problem stems from the fact that you’ve created a host array of device pointers, when in fact what you intend is a device array of device pointers.

In other words, instead of

[codebox]int N = 10;
	size_t size = N * sizeof(float); 

	

	float* hostPtr[3];

	float** devPtr=0;

	float* devPtr2=0;



	cudaMalloc((void**)&devPtr, 3*sizeof(float*));	



	for (int i=0; i<3;i++)

	{

		printf(cudaGetErrorString(cudaMallocHost((void**)&(hostPtr[i]), size)));

		printf("\n");

		cudaMalloc((void**)&devPtr2, size);

		cudaMemcpy(devPtr[i], devPtr2, sizeof(float*), cudaMemcpyDeviceToDevice);

		printf("\n");

		for(int itr=0; itr<N; itr++)

		{

			hostPtr[i][itr] = 5.5;

		}

	}

[/codebox]

Thanks,

Paul

Ok, I got the whole thing working correctly. For the sake of posterity and anyone else who gets as confused over this stuff as I’ve been, here’s the full working code:

[codebox]#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

// Device code

global void VecAdd(float** C, int N)

{ 

	int i = threadIdx.x + blockIdx.x * blockDim.x; 

	//if (i < N) 

		

	

	C[2][i] = C[0][i]+C[1][i];		

} 

global void PtrSet(float** A, float* B, int i)

{

	A[i] = B;

}

using namespace std;

int main (int argc, const char * argv)

{ 

	int N = 10;

	size_t size = N * sizeof(float); 

	

	float* hostPtr[3];

	float** devPtr=0;

	float* devPtr2[3];



	cudaMalloc((void**)&devPtr, 3*sizeof(float*));	



	for (int i=0; i<3;i++)

	{

		printf(cudaGetErrorString(cudaMallocHost((void**)&(hostPtr[i]), size)));

		printf("\n");

		printf(cudaGetErrorString(cudaMalloc((void**)&devPtr2[i], size)));

		printf("\n");



		PtrSet<<<1,1>>>(devPtr, devPtr2[i], i);

		//devPtr[i]=devPtr2;

		//cudaMemcpy(devPtr[i], devPtr2, sizeof(float*), cudaMemcpyDeviceToDevice);

		printf("\n");

		for(int itr=0; itr<N; itr++)

		{

			hostPtr[i][itr] = 5.5;

		}

	}

	printf("\nstarting cudaMemcpy of hostPtr to devPtr\n");

	for(int i = 0; i < 3;i++)

	{	

		cudaMemcpy(devPtr2[i], hostPtr[i], size, cudaMemcpyHostToDevice);

	}

	printf("copy complete\n");



	// Invoke kernel 

	int threadsPerBlock = 256; 

	int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock; 

	VecAdd<<<blocksPerGrid, threadsPerBlock>>>(devPtr, N);

	

	printf("VecAdd call complete\n");

	cudaMemcpy(hostPtr[2], devPtr2[2], size, cudaMemcpyDeviceToHost); 

	for(int incr=0;incr<N; incr++)

	{

		

		printf("[");

		printf("%i",incr);

		printf("]\t");

		printf("%f",hostPtr[0][incr]);

		printf("\t");

		printf("%f",hostPtr[1][incr]);

		printf("\t");

		printf("%f",hostPtr[2][incr]);

		printf("\n");

	}

	// Free device memory 

	cudaFree(devPtr2[0]); 

	cudaFree(devPtr2[1]); 

	cudaFree(devPtr2[2]);

	cudaFree(devPtr); 

	// Free host memory 

	cudaFreeHost(hostPtr[0]); 

	cudaFreeHost(hostPtr[1]); 

	cudaFreeHost(hostPtr[2]); 

	cudaFreeHost(devPtr2);

	

return 0;

}

[/codebox]

The interesting things to note are that I have to have the addresses of the allocated device memory stored both on the host and on the device. So in order to do cudaMemcpy, etc, from the host side, I have to use devPtr2[i] as the destination (or source, as the case may be), rather than devPtr[i]. If I try to access devPtr[i], I get segmentation faults, since it’s stored on the device, rather than the host. So devPtr2 is my 3 element array of addresses on the host, and devPtr is my 3 element array of addresses on the device (and accessible only from the device). To set devPtr[i] to the freshly allocated block of memory, devPtr itself must be passed to a kernel, and the assignment done there. I did try just passing devPtr[i], and got another segmentation fault, so you have to pass the whole thing.

Anyway, thanks for the help on this guys. I guess this is why we have a 3D and array malloc functions built-in.

Paul

No… quoting the CUDA Programming Guide:

These functions are there to allow optimized (coalesced) memory accesses on all the rows of the array. No way they are related to arrays of arrays.