Tables not correct when using __constant__

Hi,

I am currently implementing a lookup-based implementation of AES ctr encryption using CUDA.

The lookup-tables are defined as following:

device u32 Te0[256] =
{
0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU,
… (256 entries in total)
}

This works perfectly, but I want to increase speed by having it in constant memory. Defining the table as

device constant u32 Te0[256] =
{
0xc66363a5U, 0xf87c7c84U, 0xee777799U, 0xf67b7b8dU,
… (256 entries in total)
}

compiles, and runs about 10-15 times faster, however the output is wrong. Am I using the constant qualifier wrong, or could anyone give me any hints on how to use constant memory?

If required I can post the whole code.

Sincerly Magne Eimot

Try just “constant u32 Te0 …” (i.e. remove the device).

I have never used constant. BUt your explanation is quite baffling. Is it based on any of your strange experience with CUDA?

This is how we use constants:

float h_x_plane1_x1 = (x_plane[0] - x1); float h_x_planeN_x1 = ((x_plane[0]+dimx*dx) - x1);

CUDA_SAFE_CALL(cudaMemcpyToSymbol("x_plane1_x1", &h_x_plane1_x1, sizeof(h_x_plane1_x1)));

Maybe you can do something with it. I think it is the same for an array of constants.

oops forgot to mention:

__device__ __constant__ float x_plane1_x1;

this part is inside a header file like we use it.

Ur usage is fine. But I would be surprised if we cannot declared look-up tables in constant memory in the data-segment. It would hinder readability.

In the worst case, one has to declare this constant table in host memory and then do a symbol copy as per jordy…, and get it done.

Can some1 more knowledgeable comment here?

I only said to remove the device because in my code that successfully uses constants, I don’t have it :)

Maybe it is a compiler issue. Compile with -keep and check the .ptx and see if your data is there.

I see… :)

Actually, I remembered it like this:

When you say “device” – it tells the compiler that it is a GPU element.

And, you further qualify it with “constant”, “global”, etc… Thats why I was surprised by your answer. ANyway, Never mind…

I once tried that without the device and gave a lot of errors on my side. Tooklit version 1.0

Yes I’m still using the old toolkit.

OK, this has gotten a little off topic since my original post. Check the manual section 4.2.2.2 (CUDA 1.1 at least). It specifically states that the use of device with constant is optional. In both cases, the declared variable resides in the constant memory space on the device.

To get back on topic for the OP, you are going to need to narrow your problem down to a minimal reproduction and post the code here. I just wrote a test using an initialized constant array and had no problems whatsoever.

#include <stdio.h>

#  define CUDA_SAFE_CALL( call) do {                                         \

    cudaError err = call;                                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

    exit(EXIT_FAILURE);                                                      \

    } } while (0)

#ifdef NDEBUG

#define CUT_CHECK_ERROR(errorMessage)

#else

 #  define CUT_CHECK_ERROR(errorMessage) do {                                 \

    cudaThreadSynchronize();                                                \

    cudaError_t err = cudaGetLastError();                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \

                errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\

        exit(EXIT_FAILURE);                                                  \

    } } while (0)

#endif

__device__ __constant__ int constA[32] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__constant__ int constB[32] = {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131};

__global__ void copy_gmemA(int* g_odata)

	{

	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = constA[threadIdx.x];

	}

__global__ void copy_gmemB(int* g_odata)

	{

	const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

	g_odata[idx] = constB[threadIdx.x];

	}

int main()

	{

	int *d_odata, *h_odata;

	int len = 32;

	int num_threads = 32;

	CUDA_SAFE_CALL( cudaMalloc((void**)&d_odata, sizeof(int)*(len)) );

	h_odata = (int *)malloc(sizeof(int) * len);

	dim3  threads(num_threads, 1, 1);

	dim3  grid(1, 1, 1);

	

	copy_gmemA<<< grid, threads >>>(d_odata);

	CUDA_SAFE_CALL( cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost) );

	printf("A: ");

	for (int i = 0; i < 32; i++)

  printf("%d ", h_odata[i]);

	printf("\n\n");

	copy_gmemB<<< grid, threads >>>(d_odata);

	CUDA_SAFE_CALL( cudaMemcpy(h_odata, d_odata, sizeof(int)*len, cudaMemcpyDeviceToHost) );

	printf("A: ");

	for (int i = 0; i < 32; i++)

  printf("%d ", h_odata[i]);

	

	return 0;

	}

When I run this, I get the expected output:

$ ./constant_test 

A: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

A: 10 11 12 13 14 15 16 17 18 19 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131

Ok, I have had the same problem for a long time, with 1.0 and 1.1 version (as far as I remember). At some point, and copying line by line the SDK examples, I realized that the only way it works is to put the copy command, the constant declaration and the kernel in the same .cu file. Sincerely, I don’t understand why, but after a long trial and error process it is the only way it works.

Did anyone else experience the same solution? Could anyone else having the same problem please try it to see if we are talking about the same problem?

This is expected behavior. The programming guide states that variables declared constant have implied static storage in the file that they are defined in. The same goes for texture references and shared memory declarations. All must be in the same .cu file in order to be accessed by both host and device code in that file. The simple (though inelegant) solution is to include all of your .cu files into one "big.cu’ file and only compile that one.

is it always the case with the 2.2 ?

In fact I have a very strange problem.

I use constant memory.

All the declarations are done in a .cu file which is included in all the files using those constants.

It seemed to be ok, but now I have a very strange behavior.

I have sometimes wrong results in certain part of my program.

If I rebbot my PC and launch the .exe I have wrong results. Then, if I launch an other exe (a previous

without the problem), and I re-lauch after the first one, I have the good result.

So my results are different after a boot or a launch of another program.

Of course in EMuDebug, nothing appears.

Can it come from my declaration of the .cu in an included file ?

Does someone had the same problem, or Ihave an idea about this behaviour?

Thanks in advance.

I have to agree, definitely not elegant :) I used to do that but I get pleasure watching the compilation go smoothly for many many files :)

My solution to this issue is to implement host functions like InitConst*(…) in the .cu file where kernel implementation and constant definition is and call this init function from another .cu file (ex: main).