Texture Memory Cache Broadcast mechanism?

twoflower · March 17, 2008, 2:42am

Short question. I know, that the cache of constant memory provides the broadcast mechanism, exactly as the shared memory does. But how about the cache of the texture memory? Is it able to broadcast data to threads in a warp?

Thanks

MisterAnderson42 · March 17, 2008, 12:44pm

Here is a little benchmark I did a while back. Check out the warp coherent results, since those have all threads in the warp read the same values from the array.

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...76&#entry256376[/url]

In short, the texture memory reads decently fast when all warps read the same value. But the total memory throughput is still ~70 GiB/s. I’ve never seen texture reads perform faster than that.

twoflower · March 17, 2008, 10:56pm

Thank you, I found a lot of useful information in that thread.

I have one more question, but I need to describe my problem a little. I use the GPU for raytracing. The domain is spatially subdivided into a hierarchy, that is described by a tree. Threads are traversing this tree and since the threads within a block have similar direction, the traversal sequence will be similar. Therefore, most of the time they will access the same items in the memory. BUT not all the time.

I will try to draw a little ASCI diagram :)

X<=====================

|                     ^

V                     |

|---->[READ]--------->|

|                     |

|------>[READ]------->|

|                     |

|---->[READ]->[READ]->|

|                     |

|----------->[READ]-->|

I don’t know it it helps, but ASCII is fun… anyway, the point is, that there is a loop and the threads can go through different branches, but they will always hit the READ function, where they can be synchronized. Right now, I don’t synchronize them and each thread reads from the global(or texture) memory independently. But most of the time, they read the same values!

So here is my assumption… I will allocate two arrays in the shared memory. One for addresses and one for values. Each thread will store the address of required item into the first array. Then only the first thread in the block will read all the values and write them to the second array. The trick is, that if the thread reads one item, which is required by many other threads, it stores the item at all corresponding positions in the array of values. So there should not be any redundant global memory accesses.

In (pseudo)code:

// s_address is an array of size # of threads allocated in SM

// s_value is an array of size # of threads allocated in SM, all items are EMPTY

__device__ READ(int address){

   // Each thread will store the address of the required item.

   s_address[tid] = address;

  __syncthreads();

  if (tid == 0){

     // Iterate over all threads.

     for (t = 0; t < THREADCOUNT; t++){

       // Check, if we need to read the value required by thread t.

       if (s_value[t] == EMPTY){

         // Read the value from global memory required by thread t.

         tadr = s_address[t];

         val = g_mem[tadr];

         // Store the read value for all threads that require it.

         for (i = t; i < THREADCOUNT; i++)

           if (s_address[i] == tadr)

             s_value[i] = val;

     }

   }

  __syncthreads();

}

I hope, it should work. I wanted to hear another opinion. The main advantage is, that only one thread reads values, that are required by many threads, so it should save some time. Maybe there is a better way to do it, or maybe I’m completely wrong.

If you could give me an advise or say “dude, you totally misunderstood CUDA”, I will appreciate it!

Thank you

–jan

MisterAnderson42 · March 17, 2008, 11:22pm

I wouldn’t say that you totally misunderstand CUDA. Maybe just a little bit ;)

Let me start by saying that I think your idea has some merit. It is definitely worth writing a minimal benchmark to compare your idea vs the texture cache. I find that this technique of microbenchmarking small pieces like this is often needed in CUDA to decide between different strategies.

With that being said, my guess is that it will be no faster than (and possibly slower than) just using the texture cache with independent threads. I say this only because my experiences with CUDA have taught me that the Keep It Simple method of design usually wins out in terms of performance. Also, the GPU is better able to interleave memory and computation when all warps operate independently without __syncthreads().

But these are the “rules” and your case might be an exception, so by all means test it out. I’d be curious to see the results, myself.

twoflower · March 17, 2008, 11:36pm

Thanks for fast reply.

I have already tried using texture memory. I hoped, it will give me better performance than the global memory, because of the cache. But it does not. In some cases (when the tree and number of triangles in the scene is small), the texture memory wins. However in case of large scenes, the cache MISSes are very common, therefore the overhead of the cache causes slowdown and even the non-cached global memory gives better results. At this point, I have to admit, that I’m getting only approx. 5times better performance over identical implementation on CPU. Quite a shame :).

The memory reads are absolutely not-coalesced and I don’t know, if there is a simple way to coalesce them, but I don’t think so.

Do you have any other advises or suggestions?
Thanks
–jan

Topic		Replies	Views
Basic Texture Question CUDA Programming and Performance	0	562	December 26, 2010
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5512	March 6, 2007
texture memory vs global memory CUDA Programming and Performance	10	13851	August 20, 2007
About texture cache and spatial locality CUDA Programming and Performance	15	11300	July 24, 2009
For what case should I use texture memory? CUDA Programming and Performance	8	2699	May 26, 2010
Global versus Texture Memory - no speedup I'm not getting any benefits :( CUDA Programming and Performance	4	5237	February 17, 2008
Global memory latency ... and shared memory as a cache CUDA Programming and Performance	1	8361	February 17, 2008
Benefits of Texture Memory couldnt use them... CUDA Programming and Performance	6	3230	February 13, 2008
Texture memory performance CUDA Programming and Performance	4	4995	June 1, 2009
Global memory broadcasting? CUDA Programming and Performance	4	5748	October 2, 2008

Texture Memory Cache Broadcast mechanism?

Related topics