Texture cache coherency

jpape · November 14, 2007, 9:27pm

I’d like some more information on the texture cache coherency model.

When are cached locations invalidated (other than a replacement)?
Would another block’s write to global memory invalidate the other SMs’ (and its own) texture caches?
Is there a way to force an address in the texture cache to become invalid, forcing it to re-fetch from global mem?

MisterAnderson42 · November 14, 2007, 10:08pm

No clue
Nope, you can only be sure the cache is up to date by letting the kernel finish and making a new kernel call.
Nope, but things are invalidated faster than you think from all the reads going on. Just imagine: 24 warps running on a multiproc, each thread reading a float4. One warp reads in index 0, then does something with it. While that warp is calculating, all the other warps are being time sliced and performing their reads: do the math and you will see that all 24 warps are accessing about 12k of memory, which is larger than the 8k cache. By the time things come around to the first warp, the value it read has been flushed from the cache by all the other reads.

The texture cache is mainly useful for those memory patterns that you can keep data-locality among the threads of a warp, but cant make the read coalesced due to algorithmic constraints.

Believe the programming guide when it says you cannot do interblock communication.

jpape · November 14, 2007, 10:20pm

Thanks for the response. As you guessed, this is for interblock communication. For my special case, I think I can get non-cacheable interblock communication working reliably , and I have a tendency to view the programming guide’s assertion that interblock communication isn’t supported as a challenge.

My application has block-boundary-to-block-boundary dependency as well as good shared-memory locality within a block. I would like to produce some data in one block and keep running that block while another consumes the data. I believe this would be more efficient than producing some data, ending the block, and doing a new kernel call to establish new blocks since the old blocks would have to save shared-mem context to global memory and the new blocks would have to reestablish shared-mem state between kernel calls.

seb · November 14, 2007, 10:32pm

If you produce data in one block and consume it in another block you have to use global memory anyway to transfer the data because shared memory is only shared within the same block.

What prevents you from producing and consuming the data within the same block? This way shared memory will be helpful.

jpape · November 14, 2007, 10:38pm

I will also be producing and consuming data within the same block, but I need to do inter-block producer-consumer to utilize multiple SMs.

Also, the dependency relationship is only one-way, so hopefully the producer block will fire off posted global writes, the consumer will have a high initial wait due to global memory write->read latency, and then the consumer block will be able to stream subsequent consumption since the producer will produce subsequent data during the load-latency of the consumer’s consume.

MisterAnderson42 · November 14, 2007, 11:34pm

But what if the device starts running all of your consumers first? All the producers will be sitting in the scheduler queue waiting for the consumers to complete… deadlock, unless you have a very small number of consumers (i.e. 1).

jpape · November 14, 2007, 11:43pm

I think I have that covered since I’m using a GPU with atomic updates. Instead of using gridDim.x to determine which data a block is producing and consuming, one thread in each block will do an atomic read-increment to a block_id counter in global memory, write the read-value to shared memory, __syncthreads(), and all threads will read the value to know which “block” they are.

Since, once running, blocks run to completion (see http://forums.nvidia.com/index.php?showtopic=51249 ), I no-longer care what order the blocks begin execution. The 1st block to execute becomes “block” 0, 2nd becomes “block” 1, etc.

I’m not sure whether there’s a good way to this dynamic block-assignment without the atomic operations.

jpape · November 14, 2007, 11:50pm

Well, nevermind, my GPU doesn’t support atomic readIncrement. I’ll have to think about this a little more. Of course the non-robust solution would be to use few blocks and assume/hope that they will all run simultaneously.

jpape · November 15, 2007, 12:01am

I also forgot to mention that the first producer has no dependencies, so if the blocks were issued in order of gridDim.x (for 1D grid), which is not guaranteed, then my algorithm would also be ok.

seb · November 15, 2007, 12:51am

I still don’t see the benefit in this. Or is it just a nice exercise?

You will have to share the data via global memory and therefore will have no benefit over splitting the problem into multiple kernels.

However I think I get your idea and with the help of atomic writes (and reads?) it could indeed be possible to create a consumer-producer relationship between blocks within one kernel call. However as I understand it this would only work once so producers would run once and then consumers would consume. As I understand it there would be no way to create a loop of produce->consume->produce->consume that would make any sense.

jpape · November 15, 2007, 2:28am

This is for a class project, and we’ve constrained the problem to make it a little tougher to see if the tougher problem can still be implemented well on a GPU.

We’re implementing the Smith-Waterman genome-matching algorithm, which compares a DNA sequence against a large database of known sequences. If I were doing this for any other purpose, I think the most natural partitioning is to partition the known sequences across blocks and do the entire sequence-sequence match within one block, which wouldn’t require any block-block communication. Alternatively, we’d pipeline the sequences through blocks. However, since this is an exercise, we’ve decided to see how fast we can do a single sequence-sequence compare in the GPU, and we want to do so across multiple blocks to maximize parallelism in the SMs.

The algorithm basically generates a scoring matrix between the two sequences being compared. Each element of the matrix depends on the values of the three elements above, left, and above-left. Suppose we partitioned this matrix into 4 quadrants, we could perform this calculation in 3 kernel calls. The 1st call would be 1 block and calculate the upper-left quadrant. The 2nd call would be 2 blocks (upper-right and lower-left), where each of the blocks depends on values of the upper-left quadrant’s edges. The 3rd call would be 1 block to calculate the lower-right quadrant, which depends on lower-left and upper-right.

We only need 2 SM’s for this case since our largest kernel only has 2 blocks in it. Suppose the 1st kernel’s block landed on SM0. When it is done computing (all of the computing is done in shared memory), it has to write the upper-left quadrant’s boundary elements to global memory so the 2 blocks in kernel 2 can access them. Now, in the 2nd kernel call, one of the two blocks may land on the same SM as the 1st kernel. That block has to read the upper-left quadrant’s elements from global memory to re-establish it’s faster shared-memory copy. Rather than saving shared-mem state to global memory at the end of the 1st kernel and then having to re-establish the shared-mem from global memory for the 2nd kernel, we think it would be better to write enough boundary elements to global memory to allow another block to satisfy it’s dependencies, but continue working into the next quadrant out of our already-existing shared-memory image.

With this approach, we assign the upper-left and upper-right quadrants to block0 and lower-left and lower-right quadrants to block1. When block0 finishes the upper-left quadrant, it writes it’s lower-edge elements to global memory and continues working on the upper-right quadrant (without having to write/read its right-edge to/from global memory). Block1 will see block0’s lower-edge elements become valid and start working on the lower-left quadrant. Similarly, when block 0 finishes its upper-right quadrant, it writes results to global mem, block1 sees the edge-elements are valid, and it completes the lower-right quadrant.

Basically, we’re just trying to optimize out a shared->global save state and a global->shared restore state as would be required if we used multiple kernel calls. Does any of this make sense?

External Media

Neeraj_Kulkarni · November 15, 2007, 9:12am

This is for a class project, and we’ve constrained the problem to make it a little tougher to see if the tougher problem can still be implemented well on a GPU.

We’re implementing the Smith-Waterman genome-matching algorithm, which compares a DNA sequence against a large database of known sequences. If I were doing this for any other purpose, I think the most natural partitioning is to partition the known sequences across blocks and do the entire sequence-sequence match within one block, which wouldn’t require any block-block communication. Alternatively, we’d pipeline the sequences through blocks. However, since this is an exercise, we’ve decided to see how fast we can do a single sequence-sequence compare in the GPU, and we want to do so across multiple blocks to maximize parallelism in the SMs.

The algorithm basically generates a scoring matrix between the two sequences being compared. Each element of the matrix depends on the values of the three elements above, left, and above-left. Suppose we partitioned this matrix into 4 quadrants, we could perform this calculation in 3 kernel calls. The 1st call would be 1 block and calculate the upper-left quadrant. The 2nd call would be 2 blocks (upper-right and lower-left), where each of the blocks depends on values of the upper-left quadrant’s edges. The 3rd call would be 1 block to calculate the lower-right quadrant, which depends on lower-left and upper-right.

We only need 2 SM’s for this case since our largest kernel only has 2 blocks in it. Suppose the 1st kernel’s block landed on SM0. When it is done computing (all of the computing is done in shared memory), it has to write the upper-left quadrant’s boundary elements to global memory so the 2 blocks in kernel 2 can access them. Now, in the 2nd kernel call, one of the two blocks may land on the same SM as the 1st kernel. That block has to read the upper-left quadrant’s elements from global memory to re-establish it’s faster shared-memory copy. Rather than saving shared-mem state to global memory at the end of the 1st kernel and then having to re-establish the shared-mem from global memory for the 2nd kernel, we think it would be better to write enough boundary elements to global memory to allow another block to satisfy it’s dependencies, but continue working into the next quadrant out of our already-existing shared-memory image.

With this approach, we assign the upper-left and upper-right quadrants to block0 and lower-left and lower-right quadrants to block1. When block0 finishes the upper-left quadrant, it writes it’s lower-edge elements to global memory and continues working on the upper-right quadrant (without having to write/read its right-edge to/from global memory). Block1 will see block0’s lower-edge elements become valid and start working on the lower-left quadrant. Similarly, when block 0 finishes its upper-right quadrant, it writes results to global mem, block1 sees the edge-elements are valid, and it completes the lower-right quadrant.

Basically, we’re just trying to optimize out a shared->global save state and a global->shared restore state as would be required if we used multiple kernel calls. Does any of this make sense?

External Media

[snapback]279698[/snapback]

jpape,

We have been working on molecular sequence alignment techniques on the gpu for last few months. I have implemented the original smith - waterman

without gotoh’s optimization and have gained a scale up of over 12 times.

I have implemented a serial vector update technique that calculate MAX(H i-l , H i-k , H i-i j-1) But the problem always was the Block synchronization so i could not move to the next minor diagonal. Hence the __syncblock 101 problem !!

It seems we have a common interest here.

We are also trying to work on variants on N-BLAST and trying to optimize it further using data primitives on the GPU. We are also working on ORF and RNA splicers and seeing if they can be mapped to the GPU.

Cheers,

Neeraj

jpape · November 15, 2007, 4:18pm

I’ll be sure to let you know what my results are. We only have 4 weeks to complete the project, so I don’t know whether we’ll be able to achieve much speedup.

Are you using single-affine, double-affine?

Are you testing the performance of a bunch of sequences or just a single sequence?

Neeraj_Kulkarni · November 16, 2007, 8:17am

We are having gap open penalties and gap extension penalties as same, for now. (The original Smith Waterman algorithm). We are using the linear approach.

As for testing we are firing two dummy sequences of varied lengths and optimizing it further. We might use curated databases later.

Topic		Replies	Views
Doubts related to CUDA CUDA Programming and Performance	17	11875	November 18, 2010
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10111	February 29, 2008
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5548	March 6, 2007
Newbie - Need to use shared mem? CUDA Programming and Performance	27	15122	December 17, 2008
Shared memory question CUDA Programming and Performance	27	7530	June 23, 2008
About texture cache and spatial locality CUDA Programming and Performance	15	11371	July 24, 2009
Texture Cache Coherency CUDA Programming and Performance	5	2208	April 28, 2009
Texture memory coherency with respect to concurrent kernel execution CUDA Programming and Performance	4	3868	June 9, 2011
CUDA Memory Consistency CUDA Programming and Performance	23	55840	March 8, 2007
Good way to do sync between/across cores and threadblocks? CUDA Programming and Performance	14	13269	February 8, 2011

Texture cache coherency

Related topics