Constant Memory problem

Dear All

http://cuda-programming.blogspot.pt/2013/01/what-is-constant-memory-in-cuda.html

I am having problems using constant memory. I done as the link above (complete example). The only difference is that I passed the constant memory pointer (or offset of it) as a parameter to inside the kernels and use it inside. I am getting wrong results contrary to before done that. My processing capability is 3.0 and 3.5.

__constant__ __device__ unsigned char input1[65512]; //for one antenna, must be allocated for more antennas

void main()
{

unsigned char frameori[65512];

//fill frameori with data

cudaMemcpyToSymbol((void *)input1, (void *)frameori, 512+sizeof(float)*(NRSAMPLES*2*NUMBEROFANTENNAELEMENTS+NRSAMPLES*2*NUMBEROFANTENNAELEMENTS+SUBSET+NTAPS*NUSERS*NUMBEROFANTENNAELEMENTS*4)+sizeof(int)*NTAPS*NUSERS*NUMBEROFANTENNAELEMENTS,0, cudaMemcpyHostToDevice);

cudaDeviceSynchronize();

delaytran=(int *)(input1+512+sizeof(float)*(NRSAMPLES*2*nant+NRSAMPLES*2*nant+SUBSET+ntaps*nusers*nant*2));
   	   tapreal=(float *)(input1+512+sizeof(int)*ntaps*nusers*nant+sizeof(float)*(NRSAMPLES*2*nant+NRSAMPLES*2*nant+SUBSET+ntaps*nusers*nant*2));
          tapimag=(float *)(input1+512+sizeof(int)*ntaps*nusers*nant+sizeof(float)*(NRSAMPLES*2*nant+NRSAMPLES*2*nant+SUBSET+ntaps*nusers*nant*3));

ciclo4<<<NRSAMPLES/32,32,0,stream[z5]>>>((complex1 *)(input1+512+sizeof(complex1)*(NRSAMPLES*nant)), timetotal,timeuser,comp1[z5],maxdelay+atraso,g,ntaps,nant,nusers,z5,
            		   tapreal,tapimag,delaytran);

}

__global__ void ciclo4(complex1 *frame1,complex1 *timetotal,complex1 *timeuser,complex1 *comp1,int maxdelayatraso,int g,
		int ntaps,int nant,int nusers,int z5,
		   float *tapreal,float *tapimag,int *delaytran)
{
	int i1=blockIdx.x * blockDim.x + threadIdx.x;
    complex1 const7,const8,const9;
    complex1 *inri;
	complex1 *user,*total;
	int nr,t1,delay2;

	const9.r=0;
	const9.i=0;
	      for(nr=0 ; nr < nant ; nr++)
    {
		  inri=frame1+NRSAMPLES*nr;
		  user=(timeuser+NRSAMPLES*(z5*nant+nr));
          total=(timetotal+NRSAMPLES*nr);

for(t1=0; t1 < ntaps ;t1++)
{
delay2=maxdelayatraso-delaytran[t1+g*ntaps+nr*nusers*ntaps];
if ((i1-delay2) >= 0)
{
const7.r=(inri+i1-delay2)->r-total[i1-delay2].r+user[i1-delay2].r;
 const7.i=(inri+i1-delay2)->i-total[i1-delay2].i+user[i1-delay2].i;
const8.r=*(tapreal+g*ntaps+t1+nr*nusers*ntaps);
const8.i=-*(tapimag+g*ntaps+t1+nr*nusers*ntaps);

const9.r+=const7.r* const8.r - const7.i * const8.i;
const9.i+=const7.r * const8.i + const7.i * const8.r;
}
}
} //NRANTENNAS
comp1[i1].r=const9.r;
comp1[i1].i=const9.i;
}

Thanks

Luis Gonçalves

don’t pass the constant memory pointer to the kernel. This is effectively violating the CUDA rule that host code cannot take the address of a device variable (or function).

The constant memory variable has module/translation-unit scope and can be used directly without passing it explicitly. If you need to use an offset version, pass the offset only, and add that offset to the variable (pointer) in the kernel code directly.

May I make pointer conversions inside kernels with the pointer of the __constant _ variable? eg

__constant__ unsigned char input1[1000];

__global__ void kernel()
{
    float a;

   a=*((float *)(input1+offset)+offsetfloat);

}

Yes, you can create a pointer pointing to a location inside the constant array in the device code. Here is an example from the CUDA math library (file math_functions_dbl_ptx3.h):

static __constant__ double __cudart_sin_cos_coeffs[16] =
{
  [...]
};

static __forceinline__ double __internal_sin_cos_kerneld(double x, int i)
{    
  const double *coeff = __cudart_sin_cos_coeffs + 8 * (i & 1);
  [...]
}

However, the code you show looks risky. All data on the GPU must be naturally aligned, that is the alignment must be a multiple of the size of each data item. Depending on the value of ‘offset’ in your code, the resulting pointer may not be suitably aligned to access a four-byte ‘float’.

“offset” is a multiple of “sizeof(float)”. Is it aligned? I hope so.

Note that it is also possible that input1 itself is not 4-byte aligned. It would probably be best to declare input1 as an array of ‘float’. You could also use an array of ‘uchar4’, or use the align attribute with the ‘unsigned char’ array.

The code bellow is giving the following error. I double checked and I think that the indexation of input1 is right.

cudaCheckError() failed at D:/zipback/user/cuda/kernel.cu:1115
: invalid device symbol

static __align__(4) __constant__ unsigned char input1[65512]; 

int main( int argc, char *argv[ ] )
{

for(z5=0;z5 < z7;z5++)
{
         	signature1<<<20,SYMB,0,stream[z5]>>>(comp1[z5],(framecod)+((z5 << 1))*(SUBSET),(framecod)+((z5 << 1)+1)*(SUBSET),real_codigo,imag_codigo,delaytran,amptran,fasetran,g,const1,ntaps,nant,nusers,z7,z5);
}
}


__global__ void	signature1(complex1 *comp1, float *pont1,float *pont2,float *real_codigo,float *imag_codigo,
		int delaytran, int amptran, int fasetran,int g,float const1,int ntaps,int nant,int nusers,int z7,int z5)
 {

	int nu,k,k1,nr,shift,tran1;
	float const5, const6,const7,const8,const9;

	k=blockIdx.x * blockDim.x + threadIdx.x;
   // k1=blockDim.x*gridDim.x;

	    comp1[k].r=0;
	    comp1[k].i=0;
	    	if (k>=HALFSUB){ k1=NRSAMPLES-SUBSET+k;

        	}
        	else
        		k1=k;
for(nr=0;nr<nant;nr++)
{
	const6=0;
	const7=0;
	        shift=g*ntaps+nr*nusers*ntaps;
	        tran1=SUBSET*(z7*nr+z5)+k;
      	    for(nu=0; nu < ntaps ; nu++)
      		{
      		    const5=(float)(k1)*const1*((float)(*((int *)(input1+delaytran+sizeof(int)*(shift+nu)))))+*((float *)(input1+fasetran+sizeof(float)*(shift+nu)));
      		    sincosf(const5, &const8, &const9);
      		    const5=*((float *)(input1+amptran+sizeof(float)*(shift+nu)));
				const6+=const5*const9;
      		    const7+=const5*const8;
      		}
                   *(real_codigo+tran1)=(const6* *(pont1+k)-const7* *(pont2+k));
                   *(imag_codigo+tran1)=(const6* *(pont2+k)+const7* *(pont1+k));
}

}

Which line in the above snippet corresponds to kernel.cu:1115 ? The code above is not a buildable sample code I could use to try and reproduce the issue. The compiler presumably also tells you the symbol name of whatever objects is thinks is not a device object. What is that symbol name, relative to the code you posted?

The error is a runtime error. I have some code that I get from internet to report errors. The code get last error.

The line is in after call kernel signature in main. Then the error is in signature kernel.

The text of the error message suggests it pertains to a cudaMemcpyToSymbol() or similar API call, not a kernel invocation.

You are right. See bellow the error code. The transfer size is less than 64K

cudaMemcpyToSymbol((void )input1, (void )frameori, 512+sizeof(float)(NRSAMPLES2NUMBEROFANTENNAELEMENTS+NRSAMPLES2NUMBEROFANTENNAELEMENTS+SUBSET
+NTAPS
NUSERSNUMBEROFANTENNAELEMENTS4)+sizeof(int)NTAPSNUSERS*NUMBEROFANTENNAELEMENTS,0, cudaMemcpyHostToDevice);

I have never use cudaMemcpyToSymbol(), but this doesn’t look right to me:

cudaMemcpyToSymbol((void *)input1, (void *)frameori, ...);

From what I can tell from the documentation, the first argument should be just the symbol name:

cudaMemcpyToSymbol(input1, (void *)frameori, ...);

Thanks