constant vs shared memory

Hi,

question 1:

According to the programming guide, the shared memory is more efficient (2 clock cycles to issue a read or write operation) if there are no bank conflicts. As to the constant memory, it’s said they are catched. My first question is how many clock cycles are needed to issue a read operation on constant memory?

question 2:

In the kernel part, all threads in one block will access (read only) one shared array ‘AS[1000]’

__global__ void

testKernel( float* g_idata1, float* g_idata2, float* g_odata)

{

      __shared__ float AS[1000];

     //To load data into the shared memory

      for (int i = 0; i < 1000; ++i)

      {

           AS[i] = g_idata1[i];

      }

     for (int i = 0; i < 1000; ++i)

      {

           for (int j = 0; j < 1000; ++j)

           {

                g_odata[j + i * 1000] = max(AS[i], g_idata2[j]);

           }

      }

      ......

}

Since I use 512 threads per block, it seems there will be bank conflicts when accessing the shared array AS[1000]. Am I right? Because only read operation is needed to access AS[1000], I want to know if it will be more efficient to use constant array instead of shared memory to store As[1000] in my case. If it is, how should I use the constant memory? I want to declare and assign one constant array in the host like the following:

extern __constant__ float AS[1000] = {...};

And in the kernel code, I access it:

extern __constant__ float AS[1000];

__global__ void

testKernel(float* g_idata, float* g_odata)

{

      for (int i = 0; i < 1000; ++i)

      {

           for (int j = 0; j < 1000; ++j)

           {

                g_odata[j + i * 1000] = max(AS[i], g_idata[j]);

           }

      }

      ......

}

Because I can only work under the emulation mode now. I don’t know whether it is the correct way to use the constant memory. Thanks.

Can anyone tell me whether the above way to use constan variable is correct or not? Must I load values into the constant memory using ‘cudaMemcpyToSymbol’?

I’m using cudaMemcpyToSymbol() to populate/update my constant area. I got excellent performance using the constant buffer for the case where all of the threads read the same constant elements at the same time. If I understand the progamming guide correctly, for that specific case, the speed is the same as reading from a register…

John