Parameter passing bug in CUDA 2.0 x86_64 CUDA 2.0 compiler, parameter passing

The following simple kernel works fine on Windows XP SP3 (32-bit), CUDA 2.0:

extern “C” global void vadd (const float A, float a, const float B, float C, int N)

{

const int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N)

C[i] = a*A[i] + B[i];

}

However, under RH Linux (x86_64) the I find garbage in C.

If I move parameter “a” after the array pointers, it works however:

extern “C” global void vadd (const float A, const float B, float C, float a, int N)

{

const int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N)

C[i] = a*A[i] + B[i];

}

In both cases, the parameter block takes up 32 bytes total.

It appears there may be a bug in the CUDA compiler related to the alignment of arguments.

I tried padding the argument when passing it on the host side to no avail.

By the way, I’m compiling the kernel with:

$ nvcc -use_fast_math -cubin vadd.cu

Compiler version is release 2.0, v0.2.1221 x86_64.

Could someone from NVIDIA take a look?

Thanks.

Are you calling this from driver API or runtime API?

I’m using the driver API.

Thanks for having a look. If you need more info let me know.

I bet it’s a bug in your host-side code. In fact, I bet I know exactly what it is! You’re assuming you’re passing a CUdeviceptr, using sizeof(CUdeviceptr), and aligning it as such. This is not the right thing to do! Instead, you need to pass it a CUdeviceptr as a void* and align it as a void*, which means it’s actually sizeof(void*) and eight-byte-aligned on 64-bit platforms.

(I realize that the documentation is completely baffling/incorrect on this front, and I think changes to that got into 2.1)

Unfortunately that’s not the issue. I always pass CUdeviceptr with cuParamSeti as a void* and update the offset index by sizeof(void*). In fact, that’s the only portable way I could come up when using the same code base across 32- and 64-bit platforms. And I agree it’s not elegant and CUDA should address this.

However, I am not explicitly aligning the data. I assume that any reasonable compiler will align native types, and CUdeviceptr is, well, just a pointer under the hood. In this case, CUdeviceptr should already be aligned to an 8 byte boundary, no? Hmmm…I better check cuda.h. If it’s not aligned, CUDA should pad it out on 64-bit platforms.

Also, I am getting the correct count for the parameter block (three 8-byte pointers and for A, B, and C and two 4-byte integers gives me a 32-byte chunk). I don’t know of any way of aligning the final chunk since CUDA seems to handle that when passing the parameters to the device side.

Any other clues?

If you are compiling your main with gcc, are you passing the flag -malign-double ?

I am using gcc, but not -malign-double. In fact, I’m using only float (and float literals) on the both the host and device side as in my example (i.e., no hidden conversions to double are happening).

I tried passing the pointer with cuParamSetv instead of cuParamSeti since I can now see that in cuda.h the pointer is declared as an unsigned int (4-bytes). I copied the CUdeviceptr into a void* and then passed it off as 8-bytes. Still no luck…this time a seg fault intstead of just garbage output.

How does this work for anyone on 64-bit platforms? I assume CUDA is using the the 4-byte CUdeviceptr as an offset into a memory region on the device.

What is the standard procedure you are using in the runtime to pass a CUdeviceptr? Whatever you are doing in the runtime must be working.

Thanks.

I ran into the same problems on a 64 Byt-System (AMD Opteron): I converted some object oriented sample code (original from T.B. in his [post=“491467”]post[/post] - Thank you very much T.B. for this great example!) from runtime API to driver API. Trying to launch the following global function resulted in CUDA_ERROR_LAUNCH_FAILED. This error occured as soon as I tried to acces an element of [font=“Courier New”]result[/font] (even without calling the method [font=“Courier New”]Get()[/font]):

[codebox]

int align(int offset, int alignment) {

    return ((offset + alignment - 1) / alignment) * alignment;

}

/* test <<< 1, N >>> (device_data, device_result); */

cuFuncSetBlockShape(cuFunction, N, 1, 1);

offset = 0;

cuParamSetv(cuFunction, offset, &device_data, sizeof(void*));

offset = align(offset + sizeof(void*), __alignof(void*));

//before: cuParamSetv(cuFunction, offset, &device_data, sizeof(device_data));

//before: offset = align(offset + sizeof(device_data), __alignof(device_result));

cuParamSetv(cuFunction, offset, &device_result, sizeof(void*));

offset += sizeof(void*);

//before: cuParamSetv(cuFunction, offset, &device_result, sizeof(device_result));

//before: offset += sizeof(device_result);

cuParamSetSize(cuFunction, offset);

cuLaunch(cuFunction);

[/codebox]

I post this in response to MichaelChampigny’s last [post=“488889”]post[/post] and in order to receive feedback about the quality of this code especially regarding its expected future compatibility.

Additionally, I’d like to mention a few related issues / questions which I would be glad if they could be fixed / answered in a pleasing way, at least with CUDA 2.1:

[list=1]

[*] Passing paramters in the driver API: Passing parameters (especially pointers) this way is not very intuitive: Given that a CUdeviceptr is 4 Bytes long, one would expect that pointers in global and device functions were also of this size. At least for pointers as parameters of global functions this does not seem to be the case. Is there a reason why it should not work as described by the lines commented out with "//before: " in the above code snippet?

[] Probably related, sizeof(void) used within a global function returns 8 (on the Opteron system) and not as one might expect 4 (like for sizeof(CUdeviceptr)). At least when calling nvcc with -cubin, there was an opportunity to switch to the corresponding sizeof values for the GPU…

[*] Are GPU-pointers really 8 Bytes long (as supposed by MichaelChampigny in his last [post=“488889”]post[/post]) or are they only 4 Bytes long?

[*] I didn’t check lately but I’m not aware of any graphic card with more than 4 GB of memory per GPU. Thus, 4 byte pointers were enough, currently. On the other hand, if one wanted to encourage porting larger applications to CUDA, larger memories and thus larger pointers would be desirable…

[quote name=‘Carsten Scholtes’ date=‘Jan 22 2009, 09:50 AM’ post=‘495126’]

I ran into the same problems on a 64 Byt-System (AMD Opteron): I converted some object oriented sample code (original from T.B. in his [post=“491467”]post[/post] - Thank you very much T.B. for this great example!) from runtime API to driver API. Trying to launch the following global function resulted in CUDA_ERROR_LAUNCH_FAILED. This error occured as soon as I tried to acces an element of [font=“Courier New”]result[/font] (even without calling the method [font=“Courier New”]Get()[/font]):

[codebox]

int align(int offset, int alignment) {

    return ((offset + alignment - 1) / alignment) * alignment;

}

/* test <<< 1, N >>> (device_data, device_result); */

cuFuncSetBlockShape(cuFunction, N, 1, 1);

offset = 0;

cuParamSetv(cuFunction, offset, &device_data, sizeof(void*));

offset = align(offset + sizeof(void*), __alignof(void*));

//before: cuParamSetv(cuFunction, offset, &device_data, sizeof(device_data));

//before: offset = align(offset + sizeof(device_data), __alignof(device_result));

cuParamSetv(cuFunction, offset, &device_result, sizeof(void*));

offset += sizeof(void*);

//before: cuParamSetv(cuFunction, offset, &device_result, sizeof(device_result));

//before: offset += sizeof(device_result);

cuParamSetSize(cuFunction, offset);

cuLaunch(cuFunction);

[/codebox]

/quote]

Nice find Carsten. That works. I’m no longer seeing issues on 64-bit platforms when passing device pointers. It’s pretty ugly and hopefully CUDA 2.1 will make this easier.

Also, __alignof() is not portable though it’s available with most compilers. You can implement this with offsetof() if you really need to.

Thanks again.