beginners problem - global memory damage?

problem is probably very obvious but it’s my first cuda program but i can’t figure out what’s wrong with it. It works fine with 32000 elements in array but when I’m running it with 64000 I’m getting errors when coping data from device do host. Probably I’ve made some when accessing global memory. Running in emulation mode it works fine. I’m using GF 8600M GT as cuda device.

#include <iostream>

#include <vector>

#include <iostream>

template <typename T,std::size_t COUNT>

class memblock {


  memblock() : m_ptr(NULL) {

  	if(cudaMalloc((void**)&m_ptr,sizeof(T)*COUNT) != cudaSuccess)



  ~memblock() {



 void send(T* tab) {

  	if(cudaMemcpy(m_ptr,tab,sizeof(T)*COUNT,cudaMemcpyHostToDevice) != cudaSuccess)

    std::cerr<<"send error\n";


 void get(T* tab) {

  	if(cudaMemcpy(tab,m_ptr,sizeof(T)*COUNT,cudaMemcpyDeviceToHost) != cudaSuccess)

    std::cerr<<"get error\n";


 operator T* () { return m_ptr; }


  T* m_ptr;


struct __align__(16) point {

	float x;

	float y;

	float r;

	float any;


ostream& operator<< (ostream& strm,const point& pt) {

	strm<<"| "<<pt.x<<' '<<pt.y<<' '<<pt.r<<' '<<pt.any<<" |";

	return strm;


__host__ __device__ bool test(const point& pt1,const point& pt2) {

	const float a = (pt1.x-pt2.x)*(pt1.x-pt2.x);

	const float b = (pt1.y-pt2.y)*(pt1.y-pt2.y);

	const float c = (pt1.r+pt2.r)*(pt1.r+pt2.r);

	return ( a + b ) < c;


__global__ void compute(point* points,const unsigned size) {

	const unsigned pos = threadIdx.x + blockIdx.x * blockDim.x;

	point current = points[pos];

	for(unsigned i=0;i<size;++i) {

  if( (i != pos) && test(current,points[i]) ) current.any += 1;


	points[pos] = current;


int main() {

	const unsigned count = 64000;

	std::vector<point> tab(count);

	std::vector<point> tab1(count);

	for(unsigned i=0;i<count;++i) {

  tab[i].x = rand()%100;

  tab[i].y = rand()%100;

  tab[i].r = rand()%10;

  tab[i].any = 0;


	memblock<point,count> pts;




	float total = 0;

	for(unsigned i=0;i<count;++i)

  total += tab1[i].any;



	return 0;


What kind of errors are you getting?

there’s error when coping device to host memory. but only if kernel is actually performing test() function. without it there’s no error and tab1 is filled with proper data no matter how large it is.

Interesting C++ memory wrapper for your first CUDA program! But that’s not a wrong design idea.

You’re passing in pts as an argument to the global kernel.
pts is a class defined on the host.
It looks like you realize this and have even overloaded the * operator to return the internal device pointer m_ptr, but you’re still passing in pts itself as an argument to the kernel.
I’m surprised this compiles! Don’t you mean to be using *pts to get that device pointer?

Admittedly I haven’t tried to compile it myself, but that’s what I picked up on in a 1 minute scan anyway. In emulation mode, host and device pointers both live in the same address space, so memory access can work there but not on a real device.

You should put a cudaThreadSynchronize after the kernel call and check it’s returned error code. Your problem is probably either withing your kernel or in the data you pass to your kernel.

I’ve added code for testing return value from cudaThreadSynchronize() and run it with arrays of size 32000, 48000 and 64000 elements.

32000 and 48000 where computed correctly.

64000 returned error.

It is somehow related to this loop:

for(unsigned i=0;i<size;++i) {

 if( (i != pos) && test(current,points[i]) ) current.any += 1;


If I replace test(current,points[i]) with true kernel executes without error.

Arr there any limitations to loop size or operator use inside global functions?

Is there any way to debug loops inside kernel code without use od device emulation?

SPWorley don’t be surprised. There is no operator* () in class memblock. Instead there is operator T* () so there is implicit conversion to temporary pointer to device memory.

More explicit call to kernel would look like this