texture memory vs global memory

Hi,

Until now I have been using global memory directly for large amounts of data.
if I understand correctly though then texture memory basically uses global memory and as a benefit is cached. Wouldnt that mean that in most cases it is best to use textures? What are the downsides? Would there be a problem if I wanted a 3d array or couldnt I just use an array of textures. Im asking this because a benefit of CUDA is not to use textures for GPGPU as was done with shaders.
Any insights would be very helpfull, I may just be grasping the concept of CUDA texture memory wrong, I havent worked much with the texture part.

Thanks.

Well…

  1. Global memory read is faster than texture if coalesced.
  2. Global memory can be written and read later consistently.
  3. Sometimes you’d want to leave something uncached to cache more of something else.
  4. Repeatedly binding textures result in API overhead (albeit small).

The mantra says that if you are going to access neighboring memory locations with 2D spatial locality, you should get a performance increase when using texture memory instead of global memory (remember the texture cache is 2D).

I believe the main downside would be that texture is read only. AFAIK, there is currently no support for 3D textures.

asadafag, I would be interested in seeing some benchmarks comparing coalesced memory reads from global memory to neighboring reads from texture memory. If I understand correctly, when coalescing you would better hide latency within a warp. Successful usage of the texture cache would mean there would be minimal latency. Which one is better, I dunno.

I guess the bottom line is: if your memory accesses have 2D spatial locality, you should try using texture memory (if not just for curiosity). If it does (or does not) help, we would be interested in knowing why! :D

My test code… (windows)

!!d.cu

#include <stdio.h>

#include <stdlib.h>

#include <windows.h>

#define N 2097152

#define M 4

#define nth 256

struct str0{

	int *src;

	int n;

	int stride;

};

texture<int,1> texsrc;

__global__ static void sum3_v0(str0 p)

{

	int *src=p.src;

	int n=p.n;

	int stride=p.stride;

	int bid=blockIdx.x;

	int thid=threadIdx.x;

	int id=bid*nth+thid;

	if(id>=n)return;

	//src[id*4]+=src[id*4+1]+src[id*4+2]+src[id*4+3];

	//src[(id^1234)]=src[id+stride];

	//int sid=id;id<<=2;

	//src[sid]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+1)+tex1Dfetch(texsrc,id+2)+tex1Dfetch(texsrc,id+3);

	src[id]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+stride)+tex1Dfetch(texsrc,id+stride*2)+tex1Dfetch(texsrc,id+stride*3);

}

__global__ static void sum3_v1(int *src,int n,int stride)

{

	int bid=blockIdx.x;

	int thid=threadIdx.x;

	int id=bid*nth+thid;

	if(id>=n)return;

	//src[id]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+stride)+tex1Dfetch(texsrc,id+stride*2)+tex1Dfetch(texsrc,id+stride*3);

	//src[id*4]=tex1Dfetch(texsrc,id*4)+tex1Dfetch(texsrc,id*4+1)+tex1Dfetch(texsrc,id*4+2)+tex1Dfetch(texsrc,id*4+3);

	//src[id+stride]+=src[bid]+src[id+stride*2]+src[id+stride*3];

	//src[id]+src[id+stride]+src[id+stride*2]+src[id+stride*3];

	src[id]=src[id]+src[id+stride]+src[id+stride*2]+src[id+stride*3];

	//src[id]=src[(id^1234)+stride];

	//src[id]=tex1Dfetch(texsrc,id)+tex1Dfetch(texsrc,id+stride)+tex1Dfetch(texsrc,id+stride*2)+tex1Dfetch(texsrc,id+stride*3);

}

int main()

{

	int *a,*b;

	int *da,*db;

   cudaMalloc((void**)&da,sizeof(int)*N*M);a=(int*)malloc(sizeof(int)*N*M);

    cudaMalloc((void**)&db,sizeof(int)*N*M);b=(int*)malloc(sizeof(int)*N*M);

	for(int i=0;i<M*N;i++)

  a[i]=i/M;

	for(int i=0;i<M*N;i++)

  b[i]=i%N;

   cudaMemcpy(da,a,sizeof(int)*N*M,cudaMemcpyHostToDevice);

    cudaMemcpy(db,b,sizeof(int)*N*M,cudaMemcpyHostToDevice);

	cudaBindTexture(0,texsrc,db,4*N*M);

    int tryn=100;

    int t0=GetTickCount();

    str0 s;

    s.src=da;

    s.n=N;

    s.stride=N;

    for(int i=0;i<tryn;i++)

  sum3_v0<<<(N+nth-1)/nth,nth,0>>>(s);

	int t1=GetTickCount();

    for(int i=0;i<tryn;i++)

  sum3_v1<<<(N+nth-1)/nth,nth,0>>>(db,N,N);

	int t2=GetTickCount();

	printf("v0: %.6lfms\n",(double)(t1-t0)/tryn);

	printf("v1: %.6lfms\n",(double)(t2-t1)/tryn);

	cudaUnbindTexture(texsrc);

   cudaFree(db);

    cudaFree(da);

	return 0;

}

Now the performance suddenly became the same on my machine!?

Maybe last time I used a poor access pattern.

I dunno whether my benchmark will be useful. Better test oneself to be sure.

Perhaps there would be a difference in performance if you used a 2D data matrix and/or 2D grid. I guess it really depends on your code (access patterns and such).

Is there anyone with benchmark code that actually runs at the maximum device bandwidth (e.g. 86.40GB/s for a GeForce 88000GTX)? I am just curious because I have code that runs at around 60GB/s for both global and texture memory access and I don’t seem to be able to squeeze out more.

we achieved on 8800GTX’s global memory:

coalesced scatter of elements of…

–uint: 63.5GB/sec

–uint2: 70.6GB/sec

–uint4: 68.6GB/sec

–struct of 2 uints, not aligned: 15.0GB/sec

randomly scatter, but 1-1mapping of elements of…

–uint2: 1.4GB/sec

–struct of 2 uints not aligned: 1.2GB/sec

hih.

thanks. that’s similar to that what we get here.

Coalesced code for reading floats from a long vector, for the optimal number of threads and blocks (found by kind of exhaustive benchmarking) and the good alignment gives me a little over 80GB/s for normal read and around 70GB/s for textures … though some of it might be because of the little factory-overclocked EVGA 8800GTX. This the maximum I ever got from the board.

#define tx          threadIdx.x 

#define bx          blockIdx.x

#define NUM_BLOCKS  gridDim.x

#define NUM_THREADS blockDim.x

texture<float, 1, cudaReadModeElementType> texRef;

__global__ void data_kernel(const float *d_data,  

                                        const int     data_per_block, 

                                        const int     lines_per_block, 

                                        const int     extra_line_work)

{	

	int i;

    	

	/* Alloc shared memory */

	extern __shared__ float  buffer[];   

	

	/* Starting point for this block */

	const int iStart = data_per_block * bx; 

	

	/* Read the data (full lines)*/

	for (i = 0; i < lines_per_block; i++)

	{ 

         #ifndef USE_TEXTURES 

  	buffer[tx] = d_data[iStart + i*NUM_THREADS + tx];

  #else

  	buffer[tx] = tex1Dfetch(texRef, (int)(iStart + i*NUM_THREADS + tx)); 

  #endif  

	}  

	

	/* Read the rest */

	if (tx < extra_line_work)

	{

  #ifndef USE_TEXTURES

  	buffer[tx] = d_data[iStart + i*NUM_THREADS + tx];

  #else

  	buffer[tx] = texfetch(texRef, (int)(iStart + i*NUM_THREADS + tx)); 

  #endif

	}  

}

Serban

Thanks :) So do I understand it correctly that you are using 1D Threads and 1D Blocks? I am not quite sure about it but you are not making use of the data written into the shared memory location, right? Maybe I am too paranoid, but how do you make sure that the compiler doesn’t optimize that away? However 80 GB/s make me feel quite confident that the compiler didn’t do that. How large was the data your we actually using?

Yes, since I work with compressed formats, all is 1D.

I had my share of problems with the compiler optimizing out parts of my code, but I noticed that, besides what goes to output, what goes to shared or global memory is not optimized out. So the code is ok, even if it only prefetches the data.

You can get the 80GB/s only if you have VERY LARGE data, 100MB+.

Another way to confirm both the bandwidth and the size is to benchmark some memory bound CUBLAS operations, like SDOT. For me, I get peak 20GFlops (corresponding to 80GB) for vectors with at least 50,000,000 elements (so 200MB). 1 mil elements gets me around 15GFlops (so 60GB/s), 100,000 gives around 4GFlops (16GB/s) and 10,000 gives about 0.5GFlops (2GB/s). So the bandwidths decreases kind of drastically if you have small datasets.

Regards,

Serban