bug in 1 block, 1 thread example.

Hi, I have problem running simple code.

The code is really simple. It uses only 1 block, 1 thread. Kernel returns array of values containing 1, 2 resp.

I cannot get the values in host. I want my memory alloc, and args-passing is right.

Any help?

kernel part:

extern "C"

__global__ void cpyTest(int* answer)

{

	answer[0] = 1;

	answer[1] = 2;

}

Part of host code:

CUdeviceptr d_0;

	CU_SAFE_CALL(cuMemAlloc( &d_0, sizeof(int) * 2));

	// Calling kernel

	int sf = sizeof(int);

	CU_SAFE_CALL(cuFuncSetBlockShape(cpyTest, 1, 1, 1));

	CU_SAFE_CALL(cuParamSeti(cpyTest, 0, d_0));

	CU_SAFE_CALL(cuParamSetSize(cpyTest, sf*2));

	CU_SAFE_CALL(cuLaunchGrid(cpyTest, 1, 1));

	int *h_0 = (int*)malloc(sf * 2);

	CU_SAFE_CALL(cuMemcpyDtoH(h_0, d_0, sf * 2));

	printf("answer = %d, %d\n", h_0[0], h_0[1]);

After running, printf wrote:

answer = -1213066928, 134817120

It seems that cuMemcpyDtoH doesn’t copy from device to host. Any idea is appreciated.

S.

The simpler version using runtime API answers strange too.

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <cuda.h>

#include <cutil.h>

#include <math_functions.h>

extern "C"

__global__ void

cpyTest(int* answer)

{

	answer[0] = 1;

	answer[1] = 2;

}

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main(int argc, char** argv)

{

	int sf = sizeof(int);

	int* d_0;

	CUDA_SAFE_CALL(cudaMalloc( (void**) &d_0, sf * 2));

	dim3 threads(1, 1);

	dim3 grids(1, 1);

	// Calling kernel

	cpyTest<<<grids, threads>>>(d_0);

	int *h_0 = (int*)malloc(sf);

	CU_SAFE_CALL(cudaMemcpy(h_0, d_0, sf, cudaMemcpyDeviceToHost));

	printf("answer = %d, %d\n", h_0[0], h_0[1]);

	free(h_0);

	CUT_EXIT(argc, argv);

}

The answer should be 1, 2 but

it prints:

answer = 1, 828337523

Press ENTER to exit…

Thanks,

S.

cudaMemcpy(h_0, d_0, sf, cudaMemcpyDeviceToHost)

should be

cudaMemcpy(h_0, d_0, sf*2, cudaMemcpyDeviceToHost