constant memory usage problem unexpected behavior using constant memory

Dave_K · March 27, 2009, 2:57pm

When I try to add a value in constatnt memory to another constant it gets what seem to be the wrong answer.

I am new to Cuda and probably doing something wrong - but I am baffled and would appreciate someone setting me straight.

I tried to make the code example concise.

Thanks,
Dave

Emulator output (expected):

0 1.000000 +1 = 2.000000 ?
1 2.000000 +1 = 3.000000 ?
2 3.000000 +1 = 4.000000 ?
3 4.000000 +1 = 5.000000 ?
4 5.000000 +1 = 6.000000 ?
5 6.000000 +1 = 7.000000 ?

Release mode output:

0 1.000000 +1 = 1.000000 ?
1 2.000000 +1 = 1.000000 ?
2 3.000000 +1 = 1.000000 ?
3 4.000000 +1 = 1.000000 ?
4 5.000000 +1 = 1.000000 ?
5 6.000000 +1 = 1.000000 ?

Press ENTER to exit…

//
// Main Cuda routines for LLE Ray Trace
//
#include <stdlib.h> // System include files
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <float.h>
#include <cutil_inline.h> // Cuda utility routines that came with the examples

device constant double my_consts[6];
global void kernel(double d1, double d2)
{
unsigned int tid = threadIdx.x; //Ray Id
double t=my_consts[tid];
d1[tid] = t;
d2[tid] = t+1;
}
void runTest( int argc, char argv);

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main( int argc, char** argv)
{
runTest( argc, argv);

cutilExit(argc, argv);
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test for CUDA
////////////////////////////////////////////////////////////////////////////////
void
runTest( int argc, char** argv)

{
//
// Use command-line specified CUDA device, otherwise use device with highest Gflops/s
//
if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )
cutilDeviceInit(argc, argv);
else
cudaSetDevice( cutGetMaxGflopsDeviceId() );
//
// Allocate
//
double h_dat[6]={1.0,2.0,3.0,4.0,5.0,6.0};
cutilSafeCall(cudaMemcpyToSymbol(my_consts, h_dat, 48,0, cudaMemcpyHostToDevice));

double* h_out1 = (double*) malloc(48);
double* h_out2 = (double*) malloc(48);
double* d_dat1;
cutilSafeCall( cudaMalloc( (void**) &d_dat1, 48));
double* d_dat2;
cutilSafeCall( cudaMalloc( (void**) &d_dat2, 48));

dim3 grid(1, 1, 1);
dim3 block(6,1,1);
//
// Execute the kernel
//
kernel<<< grid,block >>>(d_dat1,d_dat2);
cudaThreadSynchronize();
cutilCheckMsg(“Kernel execution failed”);
//
// Copy result deposit back to host
//
cutilSafeCall( cudaMemcpy( h_out1, d_dat1, 48,
cudaMemcpyDeviceToHost) );
cutilSafeCall( cudaMemcpy( h_out2, d_dat2, 48,
cudaMemcpyDeviceToHost) );
for( unsigned int i = 0; i < 6; i++)
{
printf(“%i %f +1 = %f ?\n”,i,h_out1[i],h_out2[i]);
}
cudaThreadExit();
}

shifter1 · March 27, 2009, 3:30pm

Are you using a device with compute capability 1.2 or greater? 9000 series cards and lower cannot use doubles. Try running your code with floats. (ie, you only need to malloc 24 bytes).

Noel_Lopes · March 27, 2009, 4:49pm

Well it might not be this, but you shouldn’t do it:
double* h_out1 = (double*) malloc(48);

instead you should do:
double* h_out1 = (double*) malloc(6*sizeof(double));

or preferably
#define ARRAY_SYZE (6)
//…
double* h_out1 = (double*) malloc(ARRAY_SYZE*sizeof(double));

Same goes for cudaMalloc, …

Everything looks fine otherwise.

Dave_K · March 27, 2009, 5:15pm

I am using a Tesla C1060 1.3 card. I tried it with floats and it DOES works properly.

It least I think I am using that card. I do have a slower card in the machine as well - but the faster card is supposed to be selected by the initial logic:

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

  cutilDeviceInit(argc, argv);

else

  cudaSetDevice( cutGetMaxGflopsDeviceId() );

Is there a way to verify?

Dave_K · March 27, 2009, 5:19pm

Thanks Noel. They were initially set as you suggest - I just tried to cut the problem source as short as I could.

Dave_K · March 27, 2009, 5:35pm

I verfied it is using the right card by debugging through cutGetMaxGflopsDeviceId.

Although float works - I need double precision and that is why I specifically chose the Tesla card - so I am still searching for a fix.

Thanks for the help Shifter1

shifter1 · March 27, 2009, 6:12pm

This might just be crazy talk, but I think you also need to add a compiler flag to enable the use of double precision floats.

EDIT: https://www.cs.virginia.edu/~csadmin/wiki/i…ouble-precision

Dave_K · March 27, 2009, 7:06pm

Good thought - but already do the equivalent for the PC intel compiler I am using by supplying $(CUDA_BIN_PATH)\nvcc.exe" -arch sm_13 …

SPWorley · March 28, 2009, 12:44am

From a quick scan, it looks like you’re copying to the constant symbol correctly, I don’t see anything obviously wrong from just an eyeball pass.
As a sanity check, try using a regular array (passed by pointer as a kernel argument). That will make sure that your problems really are with constants and not with double precision support.

I know you say that you’re passing the -arch sm_13 flags properly, but that’s still the most common failure when you’re dealing with DP, so you may double and triple check that it’s really being applied.

Dave_K · March 30, 2009, 4:39pm

As a sanity check, try using a regular array (passed by pointer as a kernel argument). That will make sure that your problems really are with constants and not with double precision support.

I did as you suggested and it verfies the problem is only with constants. The code I used follows and returns the correct vaules for both emudebug and release modes. But the original version with the constants still fails.

I know you say that you’re passing the -arch sm_13 flags properly, but that’s still the most common failure when you’re dealing with DP, so you may double and triple check that it’s really being applied.

[/quote]

I again verified. Testing with the regualer passed arrays confirmed this.

I even tried installing the latest and greatest CUDA device drivers and software (2.1). All to no avail.

Any other ideas out there?

//

// Main Cuda routines for LLE Ray Trace

//

#include <stdlib.h> // System include files

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <float.h>

#include <cutil_inline.h> // Cuda utility routines that came with the examples

device constant double my_consts[6];

//global void kernel(double *d1, double *d2)

global void kernel(double *dd, double *d1, double *d2)

{

unsigned int tid = threadIdx.x; //Ray Id

// double t=my_consts[tid];

double t=dd[tid];

d1[tid] = t;

d2[tid] = t+(double)1.0;

}

void runTest( int argc, char** argv);

////////////////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////////////////

int

main( int argc, char** argv)

{

runTest( argc, argv);

cutilExit(argc, argv);

}

////////////////////////////////////////////////////////////////////////////////

//! Run a simple test for CUDA

////////////////////////////////////////////////////////////////////////////////

void

runTest( int argc, char** argv)

{

//

// Use command-line specified CUDA device, otherwise use device with highest Gflops/s

//

if( cutCheckCmdLineFlag(argc, (const char**)argv, “device”) )

  cutilDeviceInit(argc, argv);

else

  cudaSetDevice( cutGetMaxGflopsDeviceId() );

//

// Allocate host grid and ray data structures

//

double h_dat[6]={1.0,2.0,3.0,4.0,5.0,6.0};

cutilSafeCall(cudaMemcpyToSymbol(my_consts, h_dat, 48,0, cudaMemcpyHostToDevice));

double* h_out1 = (double*) malloc(48);

double* h_out2 = (double*) malloc(48);

double* d_dat1;

cutilSafeCall( cudaMalloc( (void**) &d_dat1, 48));

double* d_dat2;

cutilSafeCall( cudaMalloc( (void**) &d_dat2, 48));

double* d_dat;

cutilSafeCall( cudaMalloc( (void**) &d_dat, 48));

cutilSafeCall( cudaMemcpy( d_dat, h_dat, 48,

                            cudaMemcpyHostToDevice) );

dim3 grid(1, 1, 1);

dim3 block(6,1,1);

//

// Execute the kernel

//

// kernel<<< grid,block >>>(d_dat1,d_dat2);

kernel<<< grid,block >>>(d_dat,d_dat1,d_dat2);

  cudaThreadSynchronize();

cutilCheckMsg(“Kernel execution failed”);

//

// Copy result deposit back to host

//

cutilSafeCall( cudaMemcpy( h_out1, d_dat1, 48,

                         cudaMemcpyDeviceToHost) );

cutilSafeCall( cudaMemcpy( h_out2, d_dat2, 48,

                         cudaMemcpyDeviceToHost) );

for( unsigned int i = 0; i < 6; i++)

{

  printf("%i %f +1 = %f ?\n",i,h_out1[i],h_out2[i]);

}

cudaThreadExit();

}

Dave_K · March 31, 2009, 9:06pm

I am really stuck on this one.

Jamie_K · March 31, 2009, 11:31pm

Bizarre. Everything looks right. I only have a 1.1 device so I can’t replicate the problem.

tmurray · April 1, 2009, 12:09am

mfatica and I agree that this looks like our bug–we’ve filed this one and will let you know.

Gimurk · April 1, 2009, 4:41am

is that the true?

there is also a Tesla C1060 in my workstation, i even can’t copy data to a constant array although the syntax i am using is similar to Dave K’s…

[codebox]include <stdio.h>

include <assert.h>

include <cuda.h>

define SIZE 1024*1000

define N 1000

constant int c[N];

int main(void) {

int *a_h, *b_h, *c_h;  // pointers to host memory

int *b_d;        // pointers to device memory

int i;



// allocate arrays on host

a_h = (int *)malloc(sizeof(int)*SIZE);

b_h = (int *)malloc(sizeof(int)*SIZE);

c_h = (int *)malloc(sizeof(int)*N);

// allocate arrays on device

cudaMalloc((void **)&b_d, sizeof(int)*SIZE);

// initialize host data

for(i=0; i<SIZE; i++) {

	a_h[i] = N+i;

	b_h[i] = i;

}

for(i=0; i<N; i++) {

	c_h[i] = 1;

}

// copy data from host to device

cudaMemcpy(b_d, b_h, sizeof(int)*SIZE, cudaMemcpyHostToDevice);

cudaMemcpyToSymbol(c, c_h, sizeof(int)*N, 0, cudaMemcpyHostToDevice);

printf(“c_h[0] = %d, c[0] = %d\n”, c_h[0], c[0]);

…

}[/codebox]

note: i debug it using emu mode

Dave_K · April 1, 2009, 4:49pm

OK. Thanks.

It seems to work OK in single precision so I will work from there for now - but the real application is double precision so I am anxiusly awaiting a fix.

tmurray · April 1, 2009, 4:52pm

Compiler team seems to have already fixed it between the 2.2 beta release and now, so you should get the fix with 2.2 final. I’ll double check to make sure that the problem is actually fixed.