CUDA and 64bit Linux problem

When compiling the attached CUDA code for: G80 GTS 640 MB, AMD Opteron, 64bit Linux,
the second kernel parameter seems to get damaged.

What the code should do:

  • allocate memory on the device
  • launch the kernel with an int parameter (a dummy to trigger the error) and the device pointer to the memory allocated
  • set the value of the memory the second parameter points to (666 in this case)
  • copy the device memory back to the host

However the number assigned to the device memory (666) isn’t returned, some garbage is returned.
When compiling for emulation and printf-ing the address of the pointer parameter from the global function it doesn’t have the same value as stored in it on the host.

If we are pushing a “long” value to the kernel as the first parameter (without changing the kernel signature to “long” but keeping it as an “int”) everything is working.

Could someone please check if he could duplicate the problem on his machine? Or is there an error in my code?

I would be very glad for any help or pointers how to correct the problem.

Bjoern (1.15 KB)

Hi. I’m a new programmer and some of the CUDA methods that you were using were unknown to me. The program you had attached also had the behavior you described on my machine. However, the version below works fine. Hope it helps.

#include <cstdlib>

#include <iostream>

#include "cuda_runtime.h"

#include "kernel_t.h"

using namespace std;

int main( int, char** ) {

    int deviceCount = 0;


    if (deviceCount == 0) {

        cerr << "There is no device." << endl;



    int dev;

    for (dev = 0; dev < deviceCount; ++dev) {

        cudaDeviceProp deviceProp;

        cudaGetDeviceProperties(&deviceProp, dev);

        if (deviceProp.major >= 1) {




    if (dev == deviceCount) {

        cerr << "There is no device supporting CUDA." << endl;



    else {



   int i = 42;

    int result = call_kernel(i);

   cout <<"result " << result << " (should be 666)" << endl;

   return EXIT_SUCCESS;

#ifndef kernel_t_H

#define kernel_t_H

int call_kernel(int i);

#include <stdio.h>

#include "kernel_t.h"

__global__ void global_function (int i, int *j) {

    *j = 666;


int call_kernel(int i) {

    int *d_jp = 0;

    cudaMalloc((void**)&d_jp, sizeof(int));

   dim3 block_dim (1);

    dim3 grid_dim  (1);

   global_function<<<block_dim, grid_dim>>>(i, d_jp);

    printf("%s\n", cudaGetErrorString(cudaGetLastError()));

   int result;

    cudaMemcpy(&result, d_jp, sizeof(long), cudaMemcpyDeviceToHost);

   return result;



I do have a similar problem. I am using a framework which itself uses cudaSetupArgument() and cudaLaunch().

My programs run fine on any 32-bit linux system I have access to, but then again fail on any 64-bit systems.

Can someone from nvidia please confirm or deny if there is a problem with cudaSetupArgument() on 64-bit systems? :(

Changing my code to use the <<<>>>-calls probably means a complete rewrite of my programs due to the highly generic approach of the framework (and I even don’t want to guess the amount of time required for this :wacko:).

We are not yet aware of any issue. Would you be able to provide a simple example demonstrating the problem? If you are a registered developer, the best way is to file a bug directly on the registered developer site.


bknafla, you’re misaligning the pointer in your second call to cudaSetupArgument.
It should be at an 8 byte boundery.
Using the execution configuration <<< >>> prevends such probems.

Thank you for your fast reply. :) There is an example attached to the first post.

Are the alignment requirements documented somewhere? I couldn’t find them in the programming guide.