constant and shared memoty

Hello

I have two problems about constant and shared memory on CUDA

(1) Can I copy a float “array” on host memory to device

  constant memory? If it can, how to do that? Can anyone give

   me a simple example copy a float array from host memory 

   to device constant memory?

(2) I want to create shared memory dynamically. Suppose I want to

  create two 2D float array on shared memory depend on a variable 

  like __shared__ float array1;

        __shared__ float array2;

  I know it has to use the dynamical creation from calling the kernel function,

  but I encountered some problems on compiling.

  Can anyone give me a very simple example to do that?

Thank you very much.

Declaration:

constant float dFloats[1024];

in main():

float hFloats[1024]={123.4, 1.0} ; // fill hFloats on host…

CUDA_SAFE_CALL(cudaMemcpyToSymbol(dFloats, hFloats, 1024*sizeof(float), 0, cudaMemcpyHostToDevice));

Your kernel can then read (but not write) from the dFloats array.

#2

No, you can’t dynamically allocate shared memory.
But you could statically allocate a large chunk and just use whatever portion you like of it. If the dimensions keep changing dynamically, you could change to a 1D array and do your own array multiplies, ie array1[x+y*xsize] .

(1) constant memory has nothing with the type of data element, just make sure that its declaration is placed outside any function, and it seem can’t be dynamically allocated, copying data between constant mem and other mem region should use

cudaMemcpyToSymbol(). for example:

[codebox]constant float cons[256];

unsigned float *vals;

cudaMemcpyToSymbol(cons, vals, sizeof(float) * 256);[/codebox]

(2) dynamically allocated array in shared memory by execution configuration should be one dimension, i suggest that you access it as the way to one-dimension array, not the way to 2D

I try the following code

but the answer is very strange

1.000000000

2.000000000

3.000000000

4.000000000

5.000000000

6.000000000

7.000000000

8.000000000

9.000000000

10.000000000

Does anyone have any idea about it?

[codebox]global void kernel(float *devicearray, const float *arrayD)

{

for(int loop=0; loop<10; loop++)

*(devicearray+loop) = *(arrayD+loop)*10;

return;

};

constant_ float arrayD[100];

int main(void)

{

cudaError error;

float arrayH[10];

float *devicearray;

for(int loop=0; loop<10; loop++)

*(arrayH+loop) = loop+1;

cudaMemcpyToSymbol(arrayD, arrayH, sizeof(float)*10, cudaMemcpyHostToDevice);

cudaMalloc((void **)&devicearray, sizeof(float)*10);

kernel<<<1,1>>>(devicearray, arrayD);

error = cudaGetLastError();

if(error!=cudaSuccess)

printf(“%s\n”,cudaGetErrorString(error));

cudaMemcpy(arrayH, devicearray, sizeof(float)*10, cudaMemcpyDeviceToHost);

for(int loop=0; loop<10; loop++)

printf(“%10.9f\n”, *(arrayH+loop));

cudaFree(devicearray);

return 0;

};[/codebox]

Can’t work, the compiler can’t know that the second function argument is in constant memory.

You have to use the “arrayD” global variable directly in your kernel code.

I also use the globle variable directly

but it also gives the same result.

Can anyont help me for this problem?

I will deeply appricate fot you. Thanks.

You can allocate ‘dynamic’ amount of used shared memory, before kernel execution. In other words array sizes don’t have to be hardcoded.

Examples are in the SDK, search for file sharedmem.cuh.

You’re not using the right arguments for the copy to symbol function. It specifies an offset (probably 0). You’re giving only 4 arguments to a 5 argument function.

See the CUDA reference manual for the function definition.

You made a couple of mistakes

here is a modified version of your code that works

[codebox]__global__ void kernel(float *devicearray, const float *arrayD)

{

for(int loop=0; loop<10; loop++)

*(devicearray+loop) = *(arrayD+loop)*10;

return;

};

constant_ float arrayD[100];

int main(void)

{

cudaError error;

float arrayH[10];

float *devicearray;

for(int loop=0; loop<10; loop++)

*(arrayH+loop) = loop+1;

cudaMemcpyToSymbol(arrayD, arrayH, sizeof(float)*10, cudaMemcpyHostToDevice);

cudaMalloc((void **)&devicearray, sizeof(float)*10);

kernel<<<1,1>>>(devicearray, arrayD);

error = cudaGetLastError();

if(error!=cudaSuccess)

printf(“%s\n”,cudaGetErrorString(error));

cudaMemcpy(arrayH, devicearray, sizeof(float)*10, cudaMemcpyDeviceToHost);

for(int loop=0; loop<10; loop++)

printf(“%10.9f\n”, *(arrayH+loop));

cudaFree(devicearray);

return 0;

};[/codebox]

[/quote]