Texture Memory Cache Broadcast mechanism?

Short question. I know, that the cache of constant memory provides the broadcast mechanism, exactly as the shared memory does. But how about the cache of the texture memory? Is it able to broadcast data to threads in a warp?


Here is a little benchmark I did a while back. Check out the warp coherent results, since those have all threads in the warp read the same values from the array.

[url=“The Official NVIDIA Forums | NVIDIA”]http://forums.nvidia.com/index.php?showtop...76&#entry256376[/url]

In short, the texture memory reads decently fast when all warps read the same value. But the total memory throughput is still ~70 GiB/s. I’ve never seen texture reads perform faster than that.

Thank you, I found a lot of useful information in that thread.

I have one more question, but I need to describe my problem a little. I use the GPU for raytracing. The domain is spatially subdivided into a hierarchy, that is described by a tree. Threads are traversing this tree and since the threads within a block have similar direction, the traversal sequence will be similar. Therefore, most of the time they will access the same items in the memory. BUT not all the time.

I will try to draw a little ASCI diagram :)


|                     ^

V                     |


|                     |


|                     |


|                     |


I don’t know it it helps, but ASCII is fun… anyway, the point is, that there is a loop and the threads can go through different branches, but they will always hit the READ function, where they can be synchronized. Right now, I don’t synchronize them and each thread reads from the global(or texture) memory independently. But most of the time, they read the same values!

So here is my assumption… I will allocate two arrays in the shared memory. One for addresses and one for values. Each thread will store the address of required item into the first array. Then only the first thread in the block will read all the values and write them to the second array. The trick is, that if the thread reads one item, which is required by many other threads, it stores the item at all corresponding positions in the array of values. So there should not be any redundant global memory accesses.

In (pseudo)code:

// s_address is an array of size # of threads allocated in SM

// s_value is an array of size # of threads allocated in SM, all items are EMPTY

__device__ READ(int address){

   // Each thread will store the address of the required item.

   s_address[tid] = address;


  if (tid == 0){

     // Iterate over all threads.

     for (t = 0; t < THREADCOUNT; t++){

       // Check, if we need to read the value required by thread t.

       if (s_value[t] == EMPTY){

         // Read the value from global memory required by thread t.

         tadr = s_address[t];

         val = g_mem[tadr];

         // Store the read value for all threads that require it.

         for (i = t; i < THREADCOUNT; i++)

           if (s_address[i] == tadr)

             s_value[i] = val;





I hope, it should work. I wanted to hear another opinion. The main advantage is, that only one thread reads values, that are required by many threads, so it should save some time. Maybe there is a better way to do it, or maybe I’m completely wrong.

If you could give me an advise or say “dude, you totally misunderstood CUDA”, I will appreciate it!

Thank you


I wouldn’t say that you totally misunderstand CUDA. Maybe just a little bit ;)

Let me start by saying that I think your idea has some merit. It is definitely worth writing a minimal benchmark to compare your idea vs the texture cache. I find that this technique of microbenchmarking small pieces like this is often needed in CUDA to decide between different strategies.

With that being said, my guess is that it will be no faster than (and possibly slower than) just using the texture cache with independent threads. I say this only because my experiences with CUDA have taught me that the Keep It Simple method of design usually wins out in terms of performance. Also, the GPU is better able to interleave memory and computation when all warps operate independently without __syncthreads().

But these are the “rules” and your case might be an exception, so by all means test it out. I’d be curious to see the results, myself.

Thanks for fast reply.

I have already tried using texture memory. I hoped, it will give me better performance than the global memory, because of the cache. But it does not. In some cases (when the tree and number of triangles in the scene is small), the texture memory wins. However in case of large scenes, the cache MISSes are very common, therefore the overhead of the cache causes slowdown and even the non-cached global memory gives better results. At this point, I have to admit, that I’m getting only approx. 5times better performance over identical implementation on CPU. Quite a shame :).

The memory reads are absolutely not-coalesced and I don’t know, if there is a simple way to coalesce them, but I don’t think so.

Do you have any other advises or suggestions?