Hi,
question 1:
According to the programming guide, the shared memory is more efficient (2 clock cycles to issue a read or write operation) if there are no bank conflicts. As to the constant memory, it’s said they are catched. My first question is how many clock cycles are needed to issue a read operation on constant memory?
question 2:
In the kernel part, all threads in one block will access (read only) one shared array ‘AS[1000]’
__global__ void
testKernel( float* g_idata1, float* g_idata2, float* g_odata)
{
__shared__ float AS[1000];
//To load data into the shared memory
for (int i = 0; i < 1000; ++i)
{
AS[i] = g_idata1[i];
}
for (int i = 0; i < 1000; ++i)
{
for (int j = 0; j < 1000; ++j)
{
g_odata[j + i * 1000] = max(AS[i], g_idata2[j]);
}
}
......
}
Since I use 512 threads per block, it seems there will be bank conflicts when accessing the shared array AS[1000]. Am I right? Because only read operation is needed to access AS[1000], I want to know if it will be more efficient to use constant array instead of shared memory to store As[1000] in my case. If it is, how should I use the constant memory? I want to declare and assign one constant array in the host like the following:
extern __constant__ float AS[1000] = {...};
And in the kernel code, I access it:
extern __constant__ float AS[1000];
__global__ void
testKernel(float* g_idata, float* g_odata)
{
for (int i = 0; i < 1000; ++i)
{
for (int j = 0; j < 1000; ++j)
{
g_odata[j + i * 1000] = max(AS[i], g_idata[j]);
}
}
......
}
Because I can only work under the emulation mode now. I don’t know whether it is the correct way to use the constant memory. Thanks.