[bugreport] __alignof(CUdeviceptr) == 4, should 8


I don’t know where exactly to post it.

I have a 64-bit computer with opensuse 10.3, newest cuda installed.

The problem is that pointers on device are 8bytes long, but on host CUdeviceptr is 4 bytes long. So if i do something like this :

cuSetParami(my_func, offset, pointer1);

offset += __alignof(CUdeviceptr); //or sizeof(CUdeviceptr);

cuSetParami(my_fync, offset, pointer2);

offset += __alignof(CUdeviceptr);

cuParamSetSize(my_func, offset);

Then the kernel crashes ( and screen blinks).

if i write

offset += 8;

it works. ( which is a little weird, as 4bytes are completely random ;) )

(relaying an internal answer)

The size of pointers on the device (at least, when passed as parameters) is 8 bytes, to be compatible with 64 bit and future hardware. If you want to use the driver API to pass parameters you need to take this into account and assume pointer arguments are 8 bytes.
(and thus advance the pointer passed to cuSetParami with 8 after a pointer arg)

Does’nt “nvcc -m32” solves that issue ?

I had a related problem with NVCC in 64-bit mode, even working with 8-byte offsets. It broke my kernel when I reached a certain number of parameters for reasons I didn’t quite understand.

As you rightly point out half of the address is junk. I went to -m32 and never looked back.

I’ve had this same problem for over a month now, passing pointers as 8 bytes (64 bits) doesn’t even remotely work, similarly 32bit pointers only work some of the time…

I have ~20-30 unique kernel functions at present (this will only keep growing), most of which reference at least one pointer - generally 2-3, ‘most’ of them work with 32bit pointers, NONE of them work with 64bit pointers, and there are some which work with 32bit pointers ‘only’ if the order of my parameters ensures pointers are at the beginning, and other variables are at the end (eg: foo(int *blah, int *blah2, uint4 blah4))

This is a very ‘very’ frustrating problem, and has caused me delays in the matter of weeks… (this error occurs both on Linux AND Windows).

Post a repro case if you think it’s broken. I’ve never seen it break and I bet you’re confused as to some of the rules (which is mostly our fault–the section in 2.0 on how to pass arguments with the driver API is terrible and has been cleaned up substantially for 2.1, I think). If you can show me what you’re doing I can try to clarify things.

Hmm you’re correct (of course), the preconception that pointers were 32bit, the fact the programming guide does not specifically say they’re 64bit, and the fact that in ‘some’ cases using 32bit pointers worked (‘why’ this works is still beyond me) lead me into all the problems I’m having now.

Previously, attempting to pass 32bit pointers like so would result in errors in some kernels (yet still work in ‘most’ of mine? why?):

[codebox]cuParamSeti(kernel_func, offset, device_ptr);

offset += __alignof(CUdeviceptr);[/codebox]

Now, passing pointers as 64bit pointers like so seems to be working for all the kernels I’ve ported so far:

[codebox]cuParamSeti(kernel, offset, 0);

cuParamSeti(kernel_func, offset + 4, device_ptr);

offset += 8;[/codebox]

It’s good to hear the programming guide has been revised for 2.1 - I’m still curious why most of my existing kernels work using my first code example to pass pointer values… (all of which have pointers either at the beginning or end of the parameter list)

Edit: Fixed missing parameter offset

Second Edit: It seems to work agnostic of if I put the 32bit pointer on the lower or higher half of the 64bit pointer address…?

I don’t know why it would work whether it’s in the upper or lower half of the eight-byte region–that’s a bit odd, but you shouldn’t be doing that anyway…

Basically, the CUdeviceptr rules are identical to pointer size alignment on the platform you’re on. On a 32-bit platform, that means the offset after using a CUdeviceptr is 4 (basically the effective sizeof(CUdeviceptr)), and they have to be four-byte aligned. On a 64-bit platform, the offset is 8 and they have to be eight-byte aligned. That is true even though sizeof(CUdeviceptr) is 4.

I’ve been using:

offset += sizeof(void*)

which is the only semi-portable code I could come up with on both 32- and 64-bit platforms. This should handle alignment too. It would be nice to have an official method supported by NVIDIA along with the rules for parameter passing. I too have had problems on 64-bit Linux and solved them by juggling parameters. This kind of experimental approach is not very productive.

I have these sort of problems as well. Try compiling with your .cu file with -ptx:

nvcc -ptx file.cu

It outputs assembly code.

Look for the .param parts of your function, it gives more hints of what to do. Hope it helps.

.param .align 8 .b8 __cudaparm_oscillate_sizeDataBlock[8];

.param .s32 __cudaparm_oscillate_n;

.param .s32 __cudaparm_oscillate_h;

.param .u64 __cudaparm_oscillate_e;

Also the cuda document in the cuda sdk contains information about PTX


I am having the same problem on 64-bit Linux (I have not tested on 64-bit Windows but likely it’s a 64-bit and not a Linux specific issue).

I get nowhere passing CUdeviceptr as an 8-byte value (or using an 8-byte offset), but I am able to make my kernels work by passing them as 4-byte values only when they are at beginning of the argument list. If I insert any intervening 4-byte values like floats or ints, I get garbage output in my arrays.

Obviously, there is some 64-bit issue with parameter passing in CUDA 2.0 on 64-bit platforms. It looks like NVIDIA is having a look and you can search for my recent posts to find the thread.

If you’ve figure out some workaround, please let me know as this is a show stopper for me.

Hi Michael, did you try what I suggested? nvcc -ptx thefile.cu and look at the ptx to get any alignments and sizes.

If CUdeviceptr is defined as a

typedef unsigned int CUdeviceptr; 

in the cuda.h file and this is what the memory functions return. It would be 32 bits

but the nvcc compiler makes pointers as 64 bits.

global void testmem(int * begin, int *end )


int bs = (end - begin) / numThreads();

for (int * i = begin + getutid()*bs, * e = begin + (getutid()+1)*bs; i < e; i++)


//	DEBUG_CUDA("index pos = " << (i - begin));

	*i = 56;



ptx file

.entry testmem


.reg .u16 %rh<5>;

.reg .u32 %r<21>;

.reg .u64 %rd<19>;

.reg .pred %p<4>;

.param .u64 __cudaparm_testmem_begin;

.param .u64 __cudaparm_testmem_end;

64 bit unsigned ints.

I am running ubuntu 8.10 linux amd64, but haven’t done much cuda yet.

If I compile with -m32, then the pointers are 32 bits.

Hi - somewhat late in the day to answer this but it may help somebody out there…

Never use SetParami to pass pointers. Setparami always uses 32-bit representation for both host and device. The host may be 64-bit and/or the device may be 64-bit (FERMI devices compiled with nvcc in 64bit mode)

Use SetParamv to pass pointers and make sure you pass the right size and increment your stack size (offset) by the current size of (void*).

ALSO, be very careful that you ALIGN your variable to the right amount for (void*) - which on 64-bit systems will be an 8byte boundary. Yes, this does mean you may leave some bytes of unused stuff in your parameter stack.

I tried to answer some of this in blog entry at: http://www.visionexperts.blogspot.com/

Best regards and happy parallel coding!

Jason Dale