Using arrays-what's the correct way ?

After going through the forums, I have realized that it’s much better to use 1D arrays. I have allocated a 1D array as follows:

[codebox]

/* fill zeros for U,V,Uin,Vin*/

   //Host arrays

double *U, *V, *Uin, *Vin;

U = (double*) malloc (sizez * sizeof(double));

V = (double*) malloc (sizez * sizeof(double));

Uin = (double*) malloc (sizez * sizeof(double));

Vin = (double*) malloc (sizez * sizeof(double));

	

for(v = 0; v < sizez; ++v)

{

	U[v] = 0;

	V[v] = 0;

	Uin[v] = 0;

	Vin[v] = 0;

	// printf("U[%d] = %f, V[%d] = %f, Uin[%d] = %f, Vin[%d] = %f \n", v, U[v], v, V[v], v, Uin[v], v, Vin[v]);

}



    //device arrays

double *U_d, *V_d, *Uin_d, *Vin_d;

cudaMalloc((void**) &U_d, sizeof(double) * sizez);

cudaMalloc((void**) &V_d, sizeof(double) * sizez);

cudaMalloc((void**) &Uin_d, sizeof(double) * sizez);

cudaMalloc((void**) &Vin_d, sizeof(double) * sizez);



cudaMemcpy(U_d, U, sizeof(double) * sizez, cudaMemcpyHostToDevice);

cudaMemcpy(V_d, V, sizeof(double) * sizez, cudaMemcpyHostToDevice);

cudaMemcpy(Uin_d, Uin, sizeof(double) * sizez, cudaMemcpyHostToDevice);

cudaMemcpy(Vin_d, Vin, sizeof(double) * sizez, cudaMemcpyHostToDevice);

// Check values on device

   for(v = 0; v < sizez; ++v)

{

	printf("U_d[%d] = %f, V_d[%d] = %f, Uin_d[%d] = %f, Vin_d[%d] = %f \n", v, U_d[v], v, V_d[v], v, Uin_d[v], v, Vin_d[v]);

}

[/codebox]

Questions:

  1. Is this the right way? Is there a better way to allocate arrays?

  2. The printf for the loop in the device arrays gives me a segmentation fault, when I run the program on the device. No errors when I run it in emulation mode. My device is a Tesla C1060. Can someone explain why ??

Thanks

  1. It is the right way.

If an array is copied from host to device or back often (when those transfers create a bottleneck) you might consider using cudaMallocHost. This will allocate your host arrays in a special way that will allow faster HtD/DtH transfers with the cost of potentially less stable system (if you do it to much). This is called pinned memory or page-locked memory, look it up. But your method is the default one and is correct.

  1. You can’t call host functions from device or vice versa, with the sole exceptions of calling kernels (global functions) from the host. There are more reasons why printf cannot work on device but this one is sufficent - device and host functions are separate domains, as is device and host memory.

It works in emulation mode because emulation makes device code execute on the host and this domain division isn’t rigorously checked.

Thanks for the reply. AT the same time, if I don’t do a printf and just assign another value to the variable U_d, I get a segmentation fault. My debugger shows that a segmentation fault is generated when I assign a value, for eg: Uin_d[v] = 1;

Is there a different way to change the values inside an array?

The host and device have entirely separate address spaces. You cannot dereference a device pointer on the host. And thus any read or write of Uin_d[v] should be segfaulting.

The only way to et values to/from the device from/to the host is to use cudaMemcpy.

Hmm So I changed my code and did it this way. Now the array does not get updated with new values?? Any suggestions??

[codebox]// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <time.h>

#include <assert.h>

// includes, project

#include <cufft.h>

#include <cutil.h>

void checkCUDAError(const char *msg)

{

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{

    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

    exit(-1);

}                         

}

// Kernel that executes on the CUDA device

global void alloc_array(double *a, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx<N)

{

a[idx] = a[idx] + 10;

// printf("blockIdx.x =%d, blockDim.x = %d, threadIdx.x  = %d\n",blockIdx.x, blockDim.x, threadIdx.x );

}

}

int main()

{

/******************** Get Device info *******************/

// check the compute capability of the device

int cuda_device = 0;

int num_devices=0;

// float elapsed_time;



CUDA_SAFE_CALL( cudaGetDeviceCount(&num_devices) );

if(0==num_devices)

{

    printf("your system does not have a CUDA capable device\n");

    return 1;

}



// check if the command-line chosen device ID is within range, exit if not

if( cuda_device >= num_devices )

{

    printf("choose device ID between 0 and %d\n", num_devices-1);

    return 1;

}

cudaSetDevice( cuda_device );

cudaDeviceProp device_properties;

CUDA_SAFE_CALL( cudaGetDeviceProperties(&device_properties, cuda_device) );

if( (1 == device_properties.major) && (device_properties.minor < 1))

{

	printf("%s does not have compute capability 1.1 or later\n\n", device_properties.name);

}



printf("running on: %s\n\n", device_properties.name );



/******************* acquired device info *************/



int sizex, sizey, sizez;

sizex = 43;

sizey = 72;

sizez = sizex * sizey;

int v;

/* fill zeros for U,V,Uin,Vin*/

double *U, *V, *Uin, *Vin;

CUDA_SAFE_CALL(cudaMallocHost((void**)&U, sizeof(double) * sizez));

CUDA_SAFE_CALL(cudaMallocHost((void**)&V, sizeof(double) * sizez));

CUDA_SAFE_CALL(cudaMallocHost((void**)&Uin, sizeof(double) * sizez));

CUDA_SAFE_CALL(cudaMallocHost((void**)&Vin, sizeof(double) * sizez));



// Check for any CUDA errors

checkCUDAError("cudaMallocHost calls");



for(v = 0; v < sizez; ++v)

{

	U[v] = 0;

	V[v] = 0;

	Uin[v] = 0;

	Vin[v] = 0;

	// printf("U[%d] = %f, V[%d] = %f, Uin[%d] = %f, Vin[%d] = %f \n", v, U[v], v, V[v], v, Uin[v], v, Vin[v]);

}



double *U_d, *V_d, *Uin_d, *Vin_d;

CUDA_SAFE_CALL(cudaMalloc((void**)&U_d, sizeof(double) * sizez));

CUDA_SAFE_CALL(cudaMalloc((void**)&V_d, sizeof(double) * sizez));

CUDA_SAFE_CALL(cudaMalloc((void**)&Uin_d, sizeof(double) * sizez));

CUDA_SAFE_CALL(cudaMalloc((void**)&Vin_d, sizeof(double) * sizez));



// Check for any CUDA errors

checkCUDAError("cudaMalloc calls");



cudaMemcpy(U_d, U, sizeof(double) * sizez, cudaMemcpyHostToDevice);

cudaMemcpy(V_d, V, sizeof(double) * sizez, cudaMemcpyHostToDevice);

cudaMemcpy(Uin_d, Uin, sizeof(double) * sizez, cudaMemcpyHostToDevice);

cudaMemcpy(Vin_d, Vin, sizeof(double) * sizez, cudaMemcpyHostToDevice);



// Check for any CUDA errors

checkCUDAError("cudaMemcpy calls");



// Do calculation on device:  

int block_size;

int n_blocks = 30;

block_size = sizez/n_blocks + (sizez%n_blocks == 0? 0:1);

// int n_blocks = sizez/block_size + (sizez%block_size == 0 ? 0:1);

printf("n_blocks = %d\t block_size = %d\n", n_blocks, block_size); 

alloc_array <<< n_blocks, block_size >>> (U_d, sizez);



// Copy U_d to host

cudaMemcpy(U, U_d, sizeof(double) * sizez, cudaMemcpyDeviceToHost);

// Check values on host

for(v = 0; v < sizez; ++v)

{

	printf("U[%d] = %f, V[%d] = %f, Uin[%d] = %f, Vin[%d] = %f \n", v, U[v], v, V[v], v, Uin[v], v, Vin[v]);

}



cudaFree(U_d);

cudaFree(Uin_d);

cudaFree(V_d);

cudaFree(Vin_d);

cudaFree(U);

cudaFree(V); 

cudaFree(Uin); 

cudaFree(Vin);

return 0;

}

[/codebox]

Are you compiling with the option -arch sm_13?

If you are not, your use of doubles will result in undefined behavior.

Thank you for letting me know about this option. This fixed the issues. I am surprised that this option is not mentioned in the Programming guide or the reference manual.

Thanks External Media