Bypassing cache in Fermi

In the programming guide, I saw one option for avoiding use of L1 cache in the compiler options. Is there a way to specify this feature for a certain variable(array)?

Don’t know if this is available in CUDA, but you can do it in PTX. See Chapter 8.7.5.1 in ptx_isa_2.1.pdf.

Don’t know if this is available in CUDA, but you can do it in PTX. See Chapter 8.7.5.1 in ptx_isa_2.1.pdf.

Just use the “volatile” keyword on the variable.

Just use the “volatile” keyword on the variable.

This does not seem to work actually, at least for CUDA 3.1. I used volatile but the actual compiled PTX code uses only ‘ld.global’ which go through cache.

This does not seem to work actually, at least for CUDA 3.1. I used volatile but the actual compiled PTX code uses only ‘ld.global’ which go through cache.

Just wondering,

I understand why keep some of the variables out of the cache (make sure to preserve cache locality for things we know we want to stay there)
But why would we want to disable L1 cache all together?

Just wondering,

I understand why keep some of the variables out of the cache (make sure to preserve cache locality for things we know we want to stay there)
But why would we want to disable L1 cache all together?

My own reason for that would be to measure the effect the cache has on my execution times. But I believe his question was in fact for a given variable.

I think for specific variables the only option that has been put forward is to use inline ptx instructions.

My own reason for that would be to measure the effect the cache has on my execution times. But I believe his question was in fact for a given variable.

I think for specific variables the only option that has been put forward is to use inline ptx instructions.

You can also ready certain values with the texture cache to avoid polluting the L1 cache.

You can also ready certain values with the texture cache to avoid polluting the L1 cache.

The most common case for wanting this is in applications with incoherent memory access patterns where the L1 cache doesn’t much help anyway. Memory fetches that go through L1 always result in 128-byte transactions, but accesses that skip L1 and access L2 directly can have smaller granularities, which can help reduce over-fetch in the case of scattered access.

–Cliff

The most common case for wanting this is in applications with incoherent memory access patterns where the L1 cache doesn’t much help anyway. Memory fetches that go through L1 always result in 128-byte transactions, but accesses that skip L1 and access L2 directly can have smaller granularities, which can help reduce over-fetch in the case of scattered access.

–Cliff

Couple of complier intrinsic would solve this problem at the CUDA C level:

template T __load(T *address , LOAD_OPTIONS options);
template void __store(T *address , T value, STORE_OPTIONS options);

Couple of complier intrinsic would solve this problem at the CUDA C level:

template T __load(T *address , LOAD_OPTIONS options);
template void __store(T *address , T value, STORE_OPTIONS options);