Texture vs. Global Memory

Hi,

I’m wondering about the global memory access which is faster than the texture memory access in my experiments.

I defined a 3D floating point texture (3x3x3) as the following:

[codebox] // 3D Cube

int x = 3;

int y = 3;

int z = 3;

const cudaExtent cubeSize = make_cudaExtent(x,y,z);

// Data

float* data = (float*)malloc(x * y * z * sizeof(float));

for(int i = 0; i < x * y * z; i ++)

	*(data + i) = static_cast<float>(i);

// Create 3D array

cudaArray* buffer = 0;

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

cudaMalloc3DArray(&buffer, &channelDesc, cubeSize);

checkCudaError("allocating memory");

// Copy data to 3D array

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr = make_cudaPitchedPtr((void*)data,cubeSize.width*sizeof(float)

,cubeSize.width,cubeSize.height);

copyParams.dstArray = buffer;

copyParams.extent = cubeSize;

copyParams.kind = cudaMemcpyHostToDevice;

cudaMemcpy3D(&copyParams);

checkCudaError("copying memory");

// Bind array to 3D texture

cudaBindTextureToArray(cube, buffer, channelDesc);

checkCudaError("binding texture");[/codebox]

Further, I allocated some (I assume) global memory:

[codebox] int x = 3;

int y = 3;

int z = 3;

// Data

float* data = (float*)malloc(x * y * z * sizeof(float));

for(int i = 0; i < x * y * z; i ++)

	*(data + i) = static_cast<float>(i);

// Allocate device memory

float* globalmem;

cudaMalloc((void**)&globalmem,x*y*z*sizeof(float));

cudaMemcpy(globalmem,data,x*y*z*sizeof(float),cudaMemcpyHost

ToDevice);[/codebox]

Finally I defined the following kernels to read from the corresponding memory:

[codebox]global

void texture_kernel(int x0, int y0, int z0)

{

float score = 1;

for(int x = 0; x < x0; x ++)

{

	for(int y = 0; y < y0; y ++)

	{

		for(int z = 0; z < z0; z ++)

		{

			score = x + y + z + tex3D(cube,z,y,x);

		}

	}

}

}[/codebox]

[codebox]global

void globalmem_kernel(float* globalmem, int x0, int y0, int z0)

{

float score = 1;

for(int x = 0; x < x0; x ++)

{

	for(int y = 0; y < y0; y ++)

	{

		for(int z = 0; z < z0; z ++)

		{

			score = x + y + z + *(globalmem + y0 * x0 * x + y0 * y + z);

		}

	}

}

*(globalmem) = score;

}[/codebox]

For evaluation, I started several kernels and measured the runtime:

[codebox] dim3 dimBlock(128);

dim3 dimGrid(100);

globalmem_kernel<<<dimGrid, dimBlock>>>(globalmem, x,y,z);

cudaThreadSynchronize();

[/codebox]

For global memory I got an average runtime of 3.753ms.

But for texture memory, I got an average runtime of 9.813ms.

Thus, the global memory is 3 times faster than the texture memory. Up to know, I thought that the texture memory is the fastest memory for the GPU.

What’s wrong with that?

By calculating the score and re-define the final score in the memory part, I do avoid compiler tweaks and optimizations!

Maybe somebody can find the mistake in my experiment. Otherwise I must assume that the texture memory is not the fastest memory for the GPU.

Thanks in advice,

Daniel

This is a weird test case. You’ve got 128 threads reading and writing to the same memory location simultaneously, that can’t be good.

N.

Texture memory is also global memory, but with a cache. So if you read only once a texture element, on top of fetching it from global mem, you also have a cache miss that occurs. I am surprised it caused that much overhead.

Your test case is not good to evaluate texture speed. texture should be used when you are reusing the same data, then avoiding global fetches the second-third… time.

Hi,

the test case is similar to my application, so I guess using texture memory was the wrong way.
I think I’ve to use global memory and I should try to optimize memory access at all.

Anyway, thanks for your help!
Daniel

If that is truly your memory access pattern for your real application, constant memory will serve you much better. It is optimized for cases where all threads in a warp read from exactly the same element.

If your real dataset is bigger than 65k then doing periodic coalesced reads into shared memory and then having the inner loops read from that shared memory is advised.

Textures are actually most useful when you are unable to coalesce a warp-wide memory read due to a semi-random read pattern (spatial locality among thread reads). The texture cache is not large enough to allow any performance gain from temporal locality.