Texture vs. Global Memory

dmorlock · July 22, 2009, 2:14pm

Hi,

I’m wondering about the global memory access which is faster than the texture memory access in my experiments.

I defined a 3D floating point texture (3x3x3) as the following:

[codebox] // 3D Cube

int x = 3;

int y = 3;

int z = 3;

const cudaExtent cubeSize = make_cudaExtent(x,y,z);

// Data

float* data = (float*)malloc(x * y * z * sizeof(float));

for(int i = 0; i < x * y * z; i ++)

	*(data + i) = static_cast<float>(i);

// Create 3D array

cudaArray* buffer = 0;

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();

cudaMalloc3DArray(&buffer, &channelDesc, cubeSize);

checkCudaError("allocating memory");

// Copy data to 3D array

cudaMemcpy3DParms copyParams = {0};

copyParams.srcPtr = make_cudaPitchedPtr((void*)data,cubeSize.width*sizeof(float)

,cubeSize.width,cubeSize.height);

copyParams.dstArray = buffer;

copyParams.extent = cubeSize;

copyParams.kind = cudaMemcpyHostToDevice;

cudaMemcpy3D(&copyParams);

checkCudaError("copying memory");

// Bind array to 3D texture

cudaBindTextureToArray(cube, buffer, channelDesc);

checkCudaError("binding texture");[/codebox]

Further, I allocated some (I assume) global memory:

[codebox] int x = 3;

int y = 3;

int z = 3;

// Data

float* data = (float*)malloc(x * y * z * sizeof(float));

for(int i = 0; i < x * y * z; i ++)

	*(data + i) = static_cast<float>(i);

// Allocate device memory

float* globalmem;

cudaMalloc((void**)&globalmem,x*y*z*sizeof(float));

cudaMemcpy(globalmem,data,x*y*z*sizeof(float),cudaMemcpyHost

ToDevice);[/codebox]

Finally I defined the following kernels to read from the corresponding memory:

[codebox]global

void texture_kernel(int x0, int y0, int z0)

{

float score = 1;

for(int x = 0; x < x0; x ++)

{

	for(int y = 0; y < y0; y ++)

	{

		for(int z = 0; z < z0; z ++)

		{

			score = x + y + z + tex3D(cube,z,y,x);

		}

	}

}

}[/codebox]

[codebox]global

void globalmem_kernel(float* globalmem, int x0, int y0, int z0)

{

float score = 1;

for(int x = 0; x < x0; x ++)

{

	for(int y = 0; y < y0; y ++)

	{

		for(int z = 0; z < z0; z ++)

		{

			score = x + y + z + *(globalmem + y0 * x0 * x + y0 * y + z);

		}

	}

}

*(globalmem) = score;

}[/codebox]

For evaluation, I started several kernels and measured the runtime:

[codebox] dim3 dimBlock(128);

dim3 dimGrid(100);

globalmem_kernel<<<dimGrid, dimBlock>>>(globalmem, x,y,z);

cudaThreadSynchronize();

[/codebox]

For global memory I got an average runtime of 3.753ms.

But for texture memory, I got an average runtime of 9.813ms.

Thus, the global memory is 3 times faster than the texture memory. Up to know, I thought that the texture memory is the fastest memory for the GPU.

What’s wrong with that?

By calculating the score and re-define the final score in the memory part, I do avoid compiler tweaks and optimizations!

Maybe somebody can find the mistake in my experiment. Otherwise I must assume that the texture memory is not the fastest memory for the GPU.

Thanks in advice,

Daniel

Nico · July 22, 2009, 2:25pm

This is a weird test case. You’ve got 128 threads reading and writing to the same memory location simultaneously, that can’t be good.

N.

Jonathan_Ferland · July 22, 2009, 2:48pm

Texture memory is also global memory, but with a cache. So if you read only once a texture element, on top of fetching it from global mem, you also have a cache miss that occurs. I am surprised it caused that much overhead.

Your test case is not good to evaluate texture speed. texture should be used when you are reusing the same data, then avoiding global fetches the second-third… time.

dmorlock · August 6, 2009, 9:44am

Hi,

the test case is similar to my application, so I guess using texture memory was the wrong way.
I think I’ve to use global memory and I should try to optimize memory access at all.

Anyway, thanks for your help!
Daniel

MisterAnderson42 · August 6, 2009, 11:39am

If that is truly your memory access pattern for your real application, constant memory will serve you much better. It is optimized for cases where all threads in a warp read from exactly the same element.

If your real dataset is bigger than 65k then doing periodic coalesced reads into shared memory and then having the inner loops read from that shared memory is advised.

Textures are actually most useful when you are unable to coalesce a warp-wide memory read due to a semi-random read pattern (spatial locality among thread reads). The texture cache is not large enough to allow any performance gain from temporal locality.

Topic		Replies	Views
Texture vs Global memory which of this is faster? CUDA Programming and Performance	2	5497	August 18, 2011
Texture and Global Memory CUDA Programming and Performance	2	3862	July 11, 2007
texture memory vs global memory CUDA Programming and Performance	10	13856	August 20, 2007
Convenience of 2D CUDA texture memory against global memory CUDA Programming and Performance	4	4343	January 21, 2013
Global versus Texture Memory - no speedup I'm not getting any benefits :( CUDA Programming and Performance	4	5243	February 17, 2008
When to use textures CUDA Programming and Performance	7	8167	February 12, 2008
Question about textures CUDA Programming and Performance	5	7853	May 9, 2008
Confusion on using texture? CUDA Programming and Performance	14	4984	September 4, 2009
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2814	August 9, 2010
Texture memory faster than global memory? texture memory vs coalesced memory access CUDA Programming and Performance	3	2038	May 18, 2009

Texture vs. Global Memory

Related topics