texture memory vs global memory

e.ping · August 1, 2007, 11:56pm

Hi,

Until now I have been using global memory directly for large amounts of data.
if I understand correctly though then texture memory basically uses global memory and as a benefit is cached. Wouldnt that mean that in most cases it is best to use textures? What are the downsides? Would there be a problem if I wanted a 3d array or couldnt I just use an array of textures. Im asking this because a benefit of CUDA is not to use textures for GPGPU as was done with shaders.
Any insights would be very helpfull, I may just be grasping the concept of CUDA texture memory wrong, I havent worked much with the texture part.

Thanks.

asadafag · August 3, 2007, 5:24am

Well…

Global memory read is faster than texture if coalesced.
Global memory can be written and read later consistently.
Sometimes you’d want to leave something uncached to cache more of something else.
Repeatedly binding textures result in API overhead (albeit small).

PINS · August 5, 2007, 9:18pm

The mantra says that if you are going to access neighboring memory locations with 2D spatial locality, you should get a performance increase when using texture memory instead of global memory (remember the texture cache is 2D).

I believe the main downside would be that texture is read only. AFAIK, there is currently no support for 3D textures.

asadafag, I would be interested in seeing some benchmarks comparing coalesced memory reads from global memory to neighboring reads from texture memory. If I understand correctly, when coalescing you would better hide latency within a warp. Successful usage of the texture cache would mean there would be minimal latency. Which one is better, I dunno.

I guess the bottom line is: if your memory accesses have 2D spatial locality, you should try using texture memory (if not just for curiosity). If it does (or does not) help, we would be interested in knowing why! :D

asadafag · August 6, 2007, 3:25am

My test code… (windows)

!!d.cu

#include <stdio.h>

#include <stdlib.h>

#include <windows.h>

#define N 2097152

#define M 4

#define nth 256

struct str0{

	int *src;

	int n;

	int stride;

};

texture<int,1> texsrc;

__global__ static void sum3_v0(str0 p)

{

	int *src=p.src;

	int n=p.n;

	int stride=p.stride;

	int bid=blockIdx.x;

	int thid=threadIdx.x;

	int id=bid*nth+thid;

	if(id>=n)return;

	//src[id*4]+=src[id*4+1]+src[id*4+2]+src[id*4+3];

	//src[(id^1234)]=src[id+stride];

	//int sid=id;id<<=2;

	//src[sid]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+1)+tex1Dfetch(texsrc,id+2)+tex1Dfetch(texsrc,id+3);

	src[id]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+stride)+tex1Dfetch(texsrc,id+stride*2)+tex1Dfetch(texsrc,id+stride*3);

}

__global__ static void sum3_v1(int *src,int n,int stride)

{

	int bid=blockIdx.x;

	int thid=threadIdx.x;

	int id=bid*nth+thid;

	if(id>=n)return;

	//src[id]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+stride)+tex1Dfetch(texsrc,id+stride*2)+tex1Dfetch(texsrc,id+stride*3);

	//src[id*4]=tex1Dfetch(texsrc,id*4)+tex1Dfetch(texsrc,id*4+1)+tex1Dfetch(texsrc,id*4+2)+tex1Dfetch(texsrc,id*4+3);

	//src[id+stride]+=src[bid]+src[id+stride*2]+src[id+stride*3];

	//src[id]+src[id+stride]+src[id+stride*2]+src[id+stride*3];

	src[id]=src[id]+src[id+stride]+src[id+stride*2]+src[id+stride*3];

	//src[id]=src[(id^1234)+stride];

	//src[id]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+stride)+tex1Dfetch(texsrc,id+stride*2)+tex1Dfetch(texsrc,id+stride*3);

}

int main()

{

	int *a,*b;

	int *da,*db;

   cudaMalloc((void**)&da,sizeof(int)*N*M);a=(int*)malloc(sizeof(int)*N*M);

    cudaMalloc((void**)&db,sizeof(int)*N*M);b=(int*)malloc(sizeof(int)*N*M);

	for(int i=0;i<M*N;i++)

  a[i]=i/M;

	for(int i=0;i<M*N;i++)

  b[i]=i%N;

   cudaMemcpy(da,a,sizeof(int)*N*M,cudaMemcpyHostToDevice);

    cudaMemcpy(db,b,sizeof(int)*N*M,cudaMemcpyHostToDevice);

	cudaBindTexture(0,texsrc,db,4*N*M);

    int tryn=100;

    int t0=GetTickCount();

    str0 s;

    s.src=da;

    s.n=N;

    s.stride=N;

    for(int i=0;i<tryn;i++)

  sum3_v0<<<(N+nth-1)/nth,nth,0>>>(s);

	int t1=GetTickCount();

    for(int i=0;i<tryn;i++)

  sum3_v1<<<(N+nth-1)/nth,nth,0>>>(db,N,N);

	int t2=GetTickCount();

	printf("v0: %.6lfms\n",(double)(t1-t0)/tryn);

	printf("v1: %.6lfms\n",(double)(t2-t1)/tryn);

	cudaUnbindTexture(texsrc);

   cudaFree(db);

    cudaFree(da);

	return 0;

}

Now the performance suddenly became the same on my machine!?

Maybe last time I used a poor access pattern.

I dunno whether my benchmark will be useful. Better test oneself to be sure.

PINS · August 8, 2007, 4:59am

Perhaps there would be a difference in performance if you used a 2D data matrix and/or 2D grid. I guess it really depends on your code (access patterns and such).

quirin · August 10, 2007, 8:20am

Is there anyone with benchmark code that actually runs at the maximum device bandwidth (e.g. 86.40GB/s for a GeForce 88000GTX)? I am just curious because I have code that runs at around 60GB/s for both global and texture memory access and I don’t seem to be able to squeeze out more.

yk_cadcg · August 12, 2007, 3:16am

we achieved on 8800GTX’s global memory:

coalesced scatter of elements of…

–uint: 63.5GB/sec

–uint2: 70.6GB/sec

–uint4: 68.6GB/sec

–struct of 2 uints, not aligned: 15.0GB/sec

randomly scatter, but 1-1mapping of elements of…

–uint2: 1.4GB/sec

–struct of 2 uints not aligned: 1.2GB/sec

hih.

quirin · August 15, 2007, 6:02pm

thanks. that’s similar to that what we get here.

gserban · August 19, 2007, 11:15am

Coalesced code for reading floats from a long vector, for the optimal number of threads and blocks (found by kind of exhaustive benchmarking) and the good alignment gives me a little over 80GB/s for normal read and around 70GB/s for textures … though some of it might be because of the little factory-overclocked EVGA 8800GTX. This the maximum I ever got from the board.

#define tx          threadIdx.x 

#define bx          blockIdx.x

#define NUM_BLOCKS  gridDim.x

#define NUM_THREADS blockDim.x

texture<float, 1, cudaReadModeElementType> texRef;

__global__ void data_kernel(const float *d_data,  

                                        const int     data_per_block, 

                                        const int     lines_per_block, 

                                        const int     extra_line_work)

{	

	int i;

    	

	/* Alloc shared memory */

	extern __shared__ float  buffer[];   

	

	/* Starting point for this block */

	const int iStart = data_per_block * bx; 

	

	/* Read the data (full lines)*/

	for (i = 0; i < lines_per_block; i++)

	{ 

         #ifndef USE_TEXTURES 

  	buffer[tx] = d_data[iStart + i*NUM_THREADS + tx];

  #else

  	buffer[tx] = tex1Dfetch(texRef, (int)(iStart + i*NUM_THREADS + tx)); 

  #endif  

	}  

	

	/* Read the rest */

	if (tx < extra_line_work)

	{

  #ifndef USE_TEXTURES

  	buffer[tx] = d_data[iStart + i*NUM_THREADS + tx];

  #else

  	buffer[tx] = texfetch(texRef, (int)(iStart + i*NUM_THREADS + tx)); 

  #endif

	}  

}

Serban

quirin · August 20, 2007, 8:45am

Thanks :) So do I understand it correctly that you are using 1D Threads and 1D Blocks? I am not quite sure about it but you are not making use of the data written into the shared memory location, right? Maybe I am too paranoid, but how do you make sure that the compiler doesn’t optimize that away? However 80 GB/s make me feel quite confident that the compiler didn’t do that. How large was the data your we actually using?

gserban · August 20, 2007, 9:06am

Yes, since I work with compressed formats, all is 1D.

I had my share of problems with the compiler optimizing out parts of my code, but I noticed that, besides what goes to output, what goes to shared or global memory is not optimized out. So the code is ok, even if it only prefetches the data.

You can get the 80GB/s only if you have VERY LARGE data, 100MB+.

Another way to confirm both the bandwidth and the size is to benchmark some memory bound CUBLAS operations, like SDOT. For me, I get peak 20GFlops (corresponding to 80GB) for vectors with at least 50,000,000 elements (so 200MB). 1 mil elements gets me around 15GFlops (so 60GB/s), 100,000 gives around 4GFlops (16GB/s) and 10,000 gives about 0.5GFlops (2GB/s). So the bandwidths decreases kind of drastically if you have small datasets.

Regards,

Serban

Topic		Replies	Views
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1826	November 1, 2010
Confusion on using texture? CUDA Programming and Performance	14	4928	September 4, 2009
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13627	June 2, 2008
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2241	July 19, 2010
Texture vs. Global Memory CUDA Programming and Performance	4	2009	August 6, 2009
CUDA texture memory performance CUDA Programming and Performance	4	33500	January 13, 2009
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5494	March 6, 2007
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4695	June 10, 2008
CUDA vs DX execution times DX GPGPU code --> CUDA = slower CUDA Programming and Performance	15	13316	January 30, 2008
Texture memory fetch extremely slow CUDA Programming and Performance	13	3086	December 21, 2017

texture memory vs global memory

Related topics