Device memory access bug?

Perhaps the problem is an alignment problem.

Try using MemAllocPitch for 2D memory.

MemAllocPitch3D does not exist ?

And try one of the memcpy 3d methods or so later on…

Just guessing what could be wrong… could also be memory corrupt of graphics card ?

Try other cuda software/kernels and see if they have problems too… otherwise problem is probably with your code still…

Soon I will be able to run kernels from my own api calls (in a different language).

Perhaps later I will try to run your kernel from my own api implementation.

I am too scared to run anybodies kernel in visual studio… too scared that it might fok something up ! External Image =D

But running ptx code should be a bit more safe ? External Image

Then again… ptx is still a script and gets compiled by the nvcuda.dll ptx compiler or whatever… so if there are bugs in the compiler those could be exploited to do nasty stuff.

I think there might also be “binary versions” which are precompiled, less portable but perhaps more safe… they can be loaded into device memory… like an image ?!?

How it exactly work I don’t know yet… I would suspect there is still some string searching in their for the kernel entry point… or perhaps that’s provided via binary too.

Binary is interesting stuff.

Perhaps the problem is an alignment problem.

Try using MemAllocPitch for 2D memory.

MemAllocPitch3D does not exist ?

And try one of the memcpy 3d methods or so later on…

Just guessing what could be wrong… could also be memory corrupt of graphics card ?

Try other cuda software/kernels and see if they have problems too… otherwise problem is probably with your code still…

Soon I will be able to run kernels from my own api calls (in a different language).

Perhaps later I will try to run your kernel from my own api implementation.

I am too scared to run anybodies kernel in visual studio… too scared that it might fok something up ! External Image =D

But running ptx code should be a bit more safe ? External Image

Then again… ptx is still a script and gets compiled by the nvcuda.dll ptx compiler or whatever… so if there are bugs in the compiler those could be exploited to do nasty stuff.

I think there might also be “binary versions” which are precompiled, less portable but perhaps more safe… they can be loaded into device memory… like an image ?!?

How it exactly work I don’t know yet… I would suspect there is still some string searching in their for the kernel entry point… or perhaps that’s provided via binary too.

Binary is interesting stuff.

[quote name=‘qiminglu’ date=‘15 June 2011 - 09:21 PM’ timestamp=‘1308190913’ post=‘1252378’]

__global__ void 

  kernel_verify( int * b , double * c , int * counter )

{

  int i = (blockIdx.x*BLOCKSIZE*BLOCKSIZE) + (blockIdx.y*BLOCKSIZE) + threadIdx.x;

b[i] = 1;   //__syncthreads();

  c[i] = 2.0; //__syncthreads();

int iix = b[i];

  double oox = c[i];

if( iix!=1 )              atomicAdd(counter, 1);

  if( fabs(oox-2.0)>0.001 ) atomicAdd(counter+1, 1);

}

One more finding, if argument c is of (int *) type, or (float *) type, there are no corruptions in either arrays. So appears this misbehave occurs when the kernel is writing to a double type pointee and int/float/or other 4 byte type pointee in an interleaved way.

By the way, when the corruption happens, the integer value read back is

0x 4000 0000 0000 0000

which is exactly the double value of

2.0

I’m wondering if someone could possibly repeat this and help me make sure it’s not a hiccup of my hardware here?

[quote name=‘qiminglu’ date=‘15 June 2011 - 09:21 PM’ timestamp=‘1308190913’ post=‘1252378’]

__global__ void 

  kernel_verify( int * b , double * c , int * counter )

{

  int i = (blockIdx.x*BLOCKSIZE*BLOCKSIZE) + (blockIdx.y*BLOCKSIZE) + threadIdx.x;

b[i] = 1;   //__syncthreads();

  c[i] = 2.0; //__syncthreads();

int iix = b[i];

  double oox = c[i];

if( iix!=1 )              atomicAdd(counter, 1);

  if( fabs(oox-2.0)>0.001 ) atomicAdd(counter+1, 1);

}

One more finding, if argument c is of (int *) type, or (float *) type, there are no corruptions in either arrays. So appears this misbehave occurs when the kernel is writing to a double type pointee and int/float/or other 4 byte type pointee in an interleaved way.

By the way, when the corruption happens, the integer value read back is

0x 4000 0000 0000 0000

which is exactly the double value of

2.0

I’m wondering if someone could possibly repeat this and help me make sure it’s not a hiccup of my hardware here?

External Image Exactly! I bet it was some alignment issue here. Thanks for pointing that out and I’m gonna give it a try first thing tomorrow External Image

External Image Exactly! I bet it was some alignment issue here. Thanks for pointing that out and I’m gonna give it a try first thing tomorrow External Image

No I think you might be using the wrong copy function:

Maybe memcpy is only ment for device memory to device memory copy.

You should try some other copy functions which are ment for device to host and host to device ?

Maybe memcpy is ok… not sure… try other copy function might help.

Maybe the corruption is not real and counter is wrongly copied ?

But according to you the double ended up in the integer ? which is kinda weird External Image

Oh well I am at a lost here… but soon I might be able to try your kernel in ptx form so… time will tell External Image :)

No I think you might be using the wrong copy function:

Maybe memcpy is only ment for device memory to device memory copy.

You should try some other copy functions which are ment for device to host and host to device ?

Maybe memcpy is ok… not sure… try other copy function might help.

Maybe the corruption is not real and counter is wrongly copied ?

But according to you the double ended up in the integer ? which is kinda weird External Image

Oh well I am at a lost here… but soon I might be able to try your kernel in ptx form so… time will tell External Image :)

This function might be of some use to check base address of device pointers:

cuMemGetAddressRange

Driver API Call, maybe Runtime API has something similiar.

Not sure exactly what it does… but it probably returns the base of arbitrary pointers for memory allocations.

So if memory block is allocated and a pointer falls in it’s range, then the base address of the memory block would be returned…

So this could give some idea if perhaps something is wrong or misaligned.

Try subtracting pointers from each other or so…

This function might be of some use to check base address of device pointers:

cuMemGetAddressRange

Driver API Call, maybe Runtime API has something similiar.

Not sure exactly what it does… but it probably returns the base of arbitrary pointers for memory allocations.

So if memory block is allocated and a pointer falls in it’s range, then the base address of the memory block would be returned…

So this could give some idea if perhaps something is wrong or misaligned.

Try subtracting pointers from each other or so…