__device__ array to __global__ Cant pass a __device__ array to __global__

hi, i want to know, why when I try to pass two arrays from a device function, to a global, trhough the function square(), my code doesn´t work, but if i commented this function the program works perfect…

there is the code, the thing is, when i do that with a simple int or float element, it works too, the problem is only with the x and y array.

#include <math.h>

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

// global device variable

__device__ int N=4 ;

__device__ int sum = 0 ;

__device__ float *x ;

__device__ float *y ;

// Device function

__device__ float square ()


	// var locals 

	double pi= 3.1416;

	double deltaphi=0.0,r=0.0,theta=0.0,angulo_deseado=0.0;

	int i=0;




	switch (N)


	case 4:

		theta = 0.7853;



	case 12:





	case 20:









		for(i=0; i <N; i++){


		angulo_deseado = deltaphi*i + (pi/4);





return 0;


// k e rn e l f unc t i on − accumulator

#define E 10

__global__ void kernel( float *c){

square ();

while(sum < E){

	c[sum]= x[sum];

	sum++; }


int main(void){

float c[E],*dev_c;

cudaMalloc((int**)&dev_c, E*sizeof(int));

//cudaMalloc((int**)&y, E*sizeof(int));

cudaMemcpy(dev_c, c, E*sizeof(int), cudaMemcpyHostToDevice);

kernel<<<1,E>>>(dev_c );

cudaMemcpy(c, dev_c, E*sizeof(int), cudaMemcpyDeviceToHost);

for(int i=0; i<E; i++){

printf("%f\n",  c[i]);



return 0;}

I appreciate all your suggestions, and forgive my bad english, im just a colombian chick tryig to understand CUDA world.


Jenn T.

The problem is that [font=“Courier New”]x[/font] and [font=“Courier New”]y[/font] are not initialized, so writes to the arrays to random addresses in memory.

The commented out call to [font=“Courier New”]cudaMalloc()[/font] indicates that you were (partly) aware of the problem. That call cannot work however, because [font=“Courier New”]cudaMalloc()[/font] takes a host pointer (to another pointer) as first argument, but [font=“Courier New”]&y[/font] would be a device pointer (actually addresses of device variables cannot even be taken like that in host code, you would need to use [font=“Courier New”]cudaGetSymbolAddress()[/font] instead). The correct way of allocating a variable size array on the device is to first call [font=“Courier New”]cudaMalloc()[/font] with a pointer variable in host memory, and then either copy that to the device or use it as an argument to the kernel.

There is another problem in the code above that the loop

while(sum < E) {

        c[sum]= x[sum];



with [font=“Courier New”]sum[/font] as a global variable happens to work purely through undefined behavior that you cannot rely on. Use local (automatic) variables for loop counters instead. In this example, you likely don’t want a loop at all, just have each thread do a different piece of work:

int idx = threadIdx.x + blockDim.x * blockIdx.x;

if (idx < E) {

        c[idx]= x[idx];


thank you so much for your qick answer, however the problem seems to remain, I just want to aviod the cudaMalloc call, due to time issues,I thought there would be a way to do it without calling cudaMalloc,and create the data array directly on the device, Is there any way?

For a fixed size array, just declare the array in global memory and use [font=“Courier New”]cudaMemcpyToSymbol()[/font] / [font=“Courier New”]cudaMemcpyFromSymbol()[/font] to access it. For a variable size array there is no way around some kind of memory allocation call.