Hi,
I’m wondering about the global memory access which is faster than the texture memory access in my experiments.
I defined a 3D floating point texture (3x3x3) as the following:
[codebox] // 3D Cube
int x = 3;
int y = 3;
int z = 3;
const cudaExtent cubeSize = make_cudaExtent(x,y,z);
// Data
float* data = (float*)malloc(x * y * z * sizeof(float));
for(int i = 0; i < x * y * z; i ++)
*(data + i) = static_cast<float>(i);
// Create 3D array
cudaArray* buffer = 0;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaMalloc3DArray(&buffer, &channelDesc, cubeSize);
checkCudaError("allocating memory");
// Copy data to 3D array
cudaMemcpy3DParms copyParams = {0};
copyParams.srcPtr = make_cudaPitchedPtr((void*)data,cubeSize.width*sizeof(float)
,cubeSize.width,cubeSize.height);
copyParams.dstArray = buffer;
copyParams.extent = cubeSize;
copyParams.kind = cudaMemcpyHostToDevice;
cudaMemcpy3D(©Params);
checkCudaError("copying memory");
// Bind array to 3D texture
cudaBindTextureToArray(cube, buffer, channelDesc);
checkCudaError("binding texture");[/codebox]
Further, I allocated some (I assume) global memory:
[codebox] int x = 3;
int y = 3;
int z = 3;
// Data
float* data = (float*)malloc(x * y * z * sizeof(float));
for(int i = 0; i < x * y * z; i ++)
*(data + i) = static_cast<float>(i);
// Allocate device memory
float* globalmem;
cudaMalloc((void**)&globalmem,x*y*z*sizeof(float));
cudaMemcpy(globalmem,data,x*y*z*sizeof(float),cudaMemcpyHost
ToDevice);[/codebox]
Finally I defined the following kernels to read from the corresponding memory:
[codebox]global
void texture_kernel(int x0, int y0, int z0)
{
float score = 1;
for(int x = 0; x < x0; x ++)
{
for(int y = 0; y < y0; y ++)
{
for(int z = 0; z < z0; z ++)
{
score = x + y + z + tex3D(cube,z,y,x);
}
}
}
}[/codebox]
[codebox]global
void globalmem_kernel(float* globalmem, int x0, int y0, int z0)
{
float score = 1;
for(int x = 0; x < x0; x ++)
{
for(int y = 0; y < y0; y ++)
{
for(int z = 0; z < z0; z ++)
{
score = x + y + z + *(globalmem + y0 * x0 * x + y0 * y + z);
}
}
}
*(globalmem) = score;
}[/codebox]
For evaluation, I started several kernels and measured the runtime:
[codebox] dim3 dimBlock(128);
dim3 dimGrid(100);
globalmem_kernel<<<dimGrid, dimBlock>>>(globalmem, x,y,z);
cudaThreadSynchronize();
[/codebox]
For global memory I got an average runtime of 3.753ms.
But for texture memory, I got an average runtime of 9.813ms.
Thus, the global memory is 3 times faster than the texture memory. Up to know, I thought that the texture memory is the fastest memory for the GPU.
What’s wrong with that?
By calculating the score and re-define the final score in the memory part, I do avoid compiler tweaks and optimizations!
Maybe somebody can find the mistake in my experiment. Otherwise I must assume that the texture memory is not the fastest memory for the GPU.
Thanks in advice,
Daniel