beginners problem - global memory damage?

usr666 · September 22, 2008, 10:04pm

problem is probably very obvious but it’s my first cuda program but i can’t figure out what’s wrong with it. It works fine with 32000 elements in array but when I’m running it with 64000 I’m getting errors when coping data from device do host. Probably I’ve made some when accessing global memory. Running in emulation mode it works fine. I’m using GF 8600M GT as cuda device.

#include <iostream>

#include <vector>

#include <iostream>

template <typename T,std::size_t COUNT>

class memblock {

	public:

  memblock() : m_ptr(NULL) {

  	if(cudaMalloc((void**)&m_ptr,sizeof(T)*COUNT) != cudaSuccess)

    throw(std::bad_alloc());

  }

  ~memblock() {

  	cudaFree(m_ptr);

  }

 void send(T* tab) {

  	if(cudaMemcpy(m_ptr,tab,sizeof(T)*COUNT,cudaMemcpyHostToDevice) != cudaSuccess)

    std::cerr<<"send error\n";

  }

 void get(T* tab) {

  	if(cudaMemcpy(tab,m_ptr,sizeof(T)*COUNT,cudaMemcpyDeviceToHost) != cudaSuccess)

    std::cerr<<"get error\n";

  }

 operator T* () { return m_ptr; }

	private:

  T* m_ptr;

};

struct __align__(16) point {

	float x;

	float y;

	float r;

	float any;

};

ostream& operator<< (ostream& strm,const point& pt) {

	strm<<"| "<<pt.x<<' '<<pt.y<<' '<<pt.r<<' '<<pt.any<<" |";

	return strm;

}

__host__ __device__ bool test(const point& pt1,const point& pt2) {

	const float a = (pt1.x-pt2.x)*(pt1.x-pt2.x);

	const float b = (pt1.y-pt2.y)*(pt1.y-pt2.y);

	const float c = (pt1.r+pt2.r)*(pt1.r+pt2.r);

	return ( a + b ) < c;

}

__global__ void compute(point* points,const unsigned size) {

	const unsigned pos = threadIdx.x + blockIdx.x * blockDim.x;

	point current = points[pos];

	for(unsigned i=0;i<size;++i) {

  if( (i != pos) && test(current,points[i]) ) current.any += 1;

	}

	points[pos] = current;

}

int main() {

	const unsigned count = 64000;

	std::vector<point> tab(count);

	std::vector<point> tab1(count);

	for(unsigned i=0;i<count;++i) {

  tab[i].x = rand()%100;

  tab[i].y = rand()%100;

  tab[i].r = rand()%10;

  tab[i].any = 0;

	}

	memblock<point,count> pts;

	pts.send(&tab[0]);

	compute<<<count/320,320>>>(pts,count);

	pts.get(&tab1[0]);

	float total = 0;

	for(unsigned i=0;i<count;++i)

  total += tab1[i].any;

	std::cout<<total<<'\n';

	std::cout<<tab1[count-1]<<'\n';

	return 0;

}

jack · September 22, 2008, 11:22pm

What kind of errors are you getting?

usr666 · September 22, 2008, 11:35pm

there’s error when coping device to host memory. but only if kernel is actually performing test() function. without it there’s no error and tab1 is filled with proper data no matter how large it is.

SPWorley · September 23, 2008, 1:25am

Interesting C++ memory wrapper for your first CUDA program! But that’s not a wrong design idea.

You’re passing in pts as an argument to the global kernel.
pts is a class defined on the host.
It looks like you realize this and have even overloaded the * operator to return the internal device pointer m_ptr, but you’re still passing in pts itself as an argument to the kernel.
I’m surprised this compiles! Don’t you mean to be using *pts to get that device pointer?

Admittedly I haven’t tried to compile it myself, but that’s what I picked up on in a 1 minute scan anyway. In emulation mode, host and device pointers both live in the same address space, so memory access can work there but not on a real device.

theMarix · September 23, 2008, 5:59am

You should put a cudaThreadSynchronize after the kernel call and check it’s returned error code. Your problem is probably either withing your kernel or in the data you pass to your kernel.

usr666 · September 23, 2008, 5:16pm

I’ve added code for testing return value from cudaThreadSynchronize() and run it with arrays of size 32000, 48000 and 64000 elements.

32000 and 48000 where computed correctly.

64000 returned error.

It is somehow related to this loop:

for(unsigned i=0;i<size;++i) {

 if( (i != pos) && test(current,points[i]) ) current.any += 1;

}

If I replace test(current,points[i]) with true kernel executes without error.

Arr there any limitations to loop size or operator use inside global functions?

Is there any way to debug loops inside kernel code without use od device emulation?

SPWorley don’t be surprised. There is no operator* () in class memblock. Instead there is operator T* () so there is implicit conversion to temporary pointer to device memory.

More explicit call to kernel would look like this

compute<<<count/320,320>>>(static_cast<point*>(pts),count);

Topic		Replies	Views
memory access error CUDA Programming and Performance	11	1406	January 12, 2013
GPU Transfer problems GPU won't correctly read data out from Device to Host CUDA Programming and Performance	15	2640	August 2, 2010
strange behavior with device emulation CUDA Programming and Performance	5	2698	May 20, 2008
cudaError_enum Strange kernel failure CUDA Programming and Performance	6	9445	October 3, 2009
cudaError at memory locat.... HELP CUDA Programming and Performance	7	22644	December 9, 2009
How to implement a generic array in unified memory? CUDA Programming and Performance	7	2035	July 2, 2017
Cuda constant memory CUDA Programming and Performance cuda , kernel	5	3068	September 7, 2023
constant memory usage problem unexpected behavior using constant memory CUDA Programming and Performance	15	9710	April 1, 2009
cuda Error at memory location CUDA Programming and Performance	1	2585	April 5, 2012
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	417	October 12, 2021

beginners problem - global memory damage?

Related topics