I am trying to compare the performance of texture fetch and usual memory fetch

I am trying to compare the performance of texture fetch and usual memory fetch. I dont know where i am going wrong, The texture fetch is 0.02ms slower than normal memory fetches, but texture fetches are supposed to be faster than global memory access. Could you please help with this

[codebox]

#include<stdio.h>

#include<cuda.h>

texture<int, 1, cudaReadModeElementType> texref;

global void fetchfromtexture(int n, int *a)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if(idx < n)

{

int x = tex1D(texref, idx);

a[idx] = x;

}

}

global void fetchfromram(int n, int *a_d, int *a_d2)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if(idx < n)

a_d2[idx]=a_d[idx];

}

main()

{

int n, *a_h, *a_h2, *a_d, *a_d2, i;

float time, time2;

cudaEvent_t start,stop,start2,stop2;

cudaEventCreate(&start);

cudaEventCreate(&stop);

cudaEventCreate(&start2);

cudaEventCreate(&stop2);

printf("Enter array size: ");

scanf("%d",&n);

size_t size = sizeof(int)*n;

a_h=(int *)malloc(size);

a_h2=(int *)malloc(size);

for(i=0;i<n;i++)

a_h[i]=i*2;

printf("\n the array is");

for(i=0;i<n;i++)

 printf("%d  ", a_h[i]);

cudaMalloc((void**)&a_d,size);

cudaMalloc((void**)&a_d2,size);

cudaMemset(a_d, 0, size);

cudaMemset(a_d2, 0, size);

cudaMemcpy(a_d,a_h,size,cudaMemcpyHostToDevice);

int nb=n/32+((n%32==0)?0:1);

cudaArray* cuArray;

cudaChannelFormatDesc channelDesc=cudaCreateChannelDesc(32,0,0,0,cudaChannelFormat

KindUnsigned);

cudaMallocArray(&cuArray, &channelDesc, n, 1);

cudaMemcpyToArray(cuArray, 0, 0, a_h, size, cudaMemcpyHostToDevice);

texref.filterMode = cudaFilterModePoint;

texref.normalized = false;

texref.addressMode[0] = cudaAddressModeWrap;

texref.addressMode[1] = cudaAddressModeWrap;

cudaBindTextureToArray(texref, cuArray, channelDesc);

cudaEventRecord(start,0);

fetchfromtexture<<<nb,32>>>(n,a_d);

cudaThreadSynchronize();

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time,start,stop);

cudaEventRecord(start,0);

fetchfromram<<<nb,32>>>(n,a_d,a_d2);

cudaThreadSynchronize();

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time2,start,stop);

printf("\n\n\nThe values fetched from texture is: ");

cudaMemcpy(a_h2, a_d, size, cudaMemcpyDeviceToHost);

for(i=0;i<n;i++)

printf("  %d  ", a_h2[i]);

printf("\nAnd time taken was %.2fms", time);

for(i=0;i<n;i++)

a_h2[i]=0;

printf("\n\n\nThe values fetched from global memory is: ");

cudaMemcpy(a_h2, a_d2, size, cudaMemcpyDeviceToHost);

for(i=0;i<n;i++)

printf("  %d  ", a_h2[i]);

printf("\nAnd time taken was %.2fms\n", time2);

cudaFreeArray(cuArray);

cudaUnbindTexture(texref);

return 0;

}

[/codebox]

i coded with reference to the example program given in programming guide

Try to measure some misaligned and/or non-linear accesses as well. In the perfectly coalesced case, there is no difference in bandwidth.

I understand your frustration, I have been trying to decipher efficient texture access but the nvidia SDK examples aren’t clear and don’t explain the reasoning for what they are doing. I think that this is one of the last big problems that CUDA has to hurdle before it really takes over any industry

One thing I do have to say sharat is while textures are meant to give some locality, its only faster under the condition that your memory is close in 2D space. Nobody except for a few NVidia employees know the exact implementation, but I would suggest that while you are waiting for a real response to this topic you read up on space filling curves. The two which are normally used for textures are Z order and Hilbert Curves. If you could respond and let me know what kind of pattern your accessing your memory in, either I or another forum member may be able to give you tips, or at least tell you if your on the right track. I know for myself thats sometimes all I need.
I believe that I once read in ATI’s documentation that they use Hilbert curves, but nvidia isn’t as transparent (I wish).

If any forum member or Nvidia employee is willing to offer some example code with a brief description, I am sure that many other forum members such as myself would greatly appreciate it.

Thanks!
Clamport

I have tested with different program which is of non coalesced access, The program finds the maximum value in the array. Same problem again, texture memory is 0.02ms slower.

Thanks for understanding my problem. I dont have knowledge about Z-order and hilbert curve, I will try to get some help and let u know.

Z-order isn’t for 2d cudaArray ?(here we are working in 1d). I don’t think texture is supposed to be faster than global mem. It’s said somewhere in the Programming Guide that texture is a cache with constant latency. this latency is speeder than the glob memory’s one for 1 fetch, but not for a whole warp’s fetch when you have coalesced read. For the non-coalesced case, if your fetch aren’t “close”, you have a cache miss so it’s surpising than it’s slower too.
For my opinion, 1d texture isn’t a good idea because if you have 1d locality, you have coalescing. (The only intersting case would be on an old card of compute capability 1.0 or 1.1 where coalescing require accessing data in sequence). The main advantage of 1D texture resides in texture’s natives functions (interpolation, closest point, etc…)

If you wan’t to compare texture fetch and global memory fetch performance, make a 2d test.

1d textures can be very useful if you are reading linearly through an unaligned array. They basically double the throughput on 1.2/1.3 devices as coalesced unaligned reads are turned into two memory accesses, while with the texture only one goes to memory and the other one comes from the cache.

You ignore cases where coalescing is impossible. I have kernels that only need to read in particles in the neighborhood of a given particle, thus the accesses are semi-random. Using tex1Dfetch is a factor of 5 faster than using uncoalesced memory reads.

To the OP: it is best to think of tex1Dfetch as an “efficient uncoalesced memory read”. If all accesses within the warp have decent locality, the texture fetches will provide extreme performance benefits over global memory reads. If you already have coalesced reads, there will be no performance benefit.

Fermi’s L1 cache makes things even more interesting, one I still don’t have a good grasp of. Sometimes tex1Dfetch is faster and sometimes global mem reads through L1 are faster…

Alright this is where my confusion begins. If your talking about particles, I assume your working in 2D, correct? If that is so, why are you using tex1Dfetch? From what I have read in the CUDA documentation tex1Dfetch offers locality along the same row but not column wise, or am I mistaken and ALL textures have 2D locality.

I would be very interested to see your code, if you would mind posting some snippits so us not fortunate enough to completely understand texture fetches might be able to benefit from it.

Thanks,

Clamport

Sure, I’m happy to elaborate some.

I’m actually working in 3D. The same principles apply in 2D. tex2D is of little use for particle simulations because particles aren’t nicely laid out in a rectangular grid :) tex2D would be useful for simulations on a lattice, though.

Here is the short version: for the long version, you can read my paper (link at end).

I’ve got an array of particles (demonstrated in 2D for simplicity):

d_pos = [ (0,0) (1,0.1) (10, 5) (11,6) (-1,-0.5)]

In real simulations, there are 10,000 - 100,000+ particles in 3D. The science of the models we run (this is Molecular Dynamics by the way, if you want to look it up) requires that we only compute interactions between pairs of particles with a distance rcut of one another. Using that rcut, we generate a neighbor list that lists all the neighbors within rcut for each particle. The algorithms actually used to generate the nlist are too long to describe here: see the paper if you are interested. Suffice to say that it is a slow inefficient process, so we include “extra” neighbors so that the list only needs to be updated every so often.

So the nlist looks like this: each column lists the indices of the neighbors for the particle at that column. I.e. particle 0 has neighbors 1 and 4, assuming rcut=2 for this example. Real simulations have nlists around ~100 per particle.

d_n_neigh = [2 2 1 1 2]

d_nlist = [ 1 0 3 2 0

			4 4	 1]

Now, computing the interactions between neighbors is easy and efficient given this list:

__global__ compute_interations(...)

{

int i = blockIdx.x*blockDim.x + threadIdx.x;

my_pos = pos[i];

n_neigh = d_n_neigh[i];

for cur_neigh = 0 to n_neigh-1

	j = d_nlist[cur_neigh][i]

	neigh_pos = pos[j]

	.. compute and tally interaction between i and j ...

end

.. write out final result for i ..

}

Now, I think everyone can agree that neigh_pos = pos[j] will be uncoalesced and that there is absolutely no way to make it coalesced. The reads depend on the order of the particles in the list and the order they end up in the neighbor list. If this read is performed with tex1Dfetch (and we apply a little trick), performance is 3x-5x faster than using global memory reads. The little trick is that in these simulations, particles diffuse everywhere and their order is essentially random. By reordering the particles along a hilbert curve periodically (every 300 steps), maximum performance is attained. The reordering puts particles that are near each other in space near each other in the array - maximizing the texture hit rate in the tex1Dfetch as much as one can.

For all the gory details, read my paper on the subject: http://codeblue.umich.edu/hoomd-blue/about.html#paper . You can also download the code from the website and browse it if you wish. All the cuda kernels are nicely tucked in the libhomd/cuda directory.

Regarding the portion of the cuda programming guide you referenced, I think that either paragraph is a bit misleading or it is implicitly referring to tex2D only. tex1Dfetch definitely only performs well with 1D locality in the reads. tex2D performs well with 2D locality and tex3D performs well with 3D locality in the reads. tex2Dfetch is a convenience for indexing and actually only has 1D locality along the rows.

The conventional wisdom on the forums (and has been loosely alluded to by NVIDIA) is that textures stored in cudaArray are on a Z-order curve, providing the effect of good 2D locality in an otherwise 1D locality hardware cache. The hilbert curve I use to reorder the particles is the exact same principle, and is just a few % faster than the same applied with a Z-order curve. The difference, as I noted at the beginning, is that the built in tex2D and tex3D hardware only works when accessing a dense lattice and I have an array of off-lattice coordinates.

On behalf of all of us who didn’t quite understand textures ( and some who still don’t :) ), thank you for your post Anderson.
~Clamport