3GB can it be read as texture?

I have a 3giga byte global array. It is currently accesses as unsigned int (4bytes).
I think it would make sense to read it in 64byte chunks.
The documentation for K20c says max size of texture is 227 but I think this refers
to the size of the index, ie if accessed as 4-byte words the texture can cover
4 * 2
27 bytes.QUESTION am I right?

If so could I define the texture with a 64byte structure and so the max index would be
about 47million (which is less than 2**27). QUESTION would this work?

Finally the examples I have been using date back to CUDA 5.0 or earlier and I suspect
newer versions of CUDA have changed how to program textures. (I noticed cuda-6.0 smaples
has a “bindless” texture.) Does anyone have working (CUDA 6) examples of read access
to huge arrays via textures they would be prepared to post?

Many thanks
Bill
http://www.cs.ucl.ac.uk/staff/W.Langdon/

I think the largest textures have uint4 or float4 elements, i.e. 16 bytes per texel.
That will now allow you to address 3GB of memory with one 1D texture.

2D may allow you to address more memory at once, but then you have to deal with
indexing along two axes (causing a bit of overhead)

(1) You could cover 3 GB by using multiple textures. I used this approach on pre-Kepler hardware. This is a bit cumbersome, but I hid the complexity of the “texture-splitting” inside a macro used for read access.

(2) If you are on K20c, you can get the benefits of loads through the texture path with the LDG instruction without setting up any textures. This approach forgoes other benefits of textures such as cheap interpolation. The compiler can generate LDG automatically, but there are no guarantees. You can use cuobjdump --dumpsass to check whether regular LD or LDG is generated. Religious use of the “const” and “restrict” modifiers on all applicable pointer arguments helps the compiler generate the instruction (see Programming Guide). The __ldg() intrinsic gives direct access to the instruction, without having to rely on compiler magic.

Dear cbuchner1 and njuffa,
Thank you very much for your helpful replies.
So it sounds like I cannot simply map 3GB via a single 1D texture.

The whole kernel is running very badly. I think this is due to the way it is
reading from the 3GB array. I would guess the additional indexing overhead
of using a 2D texture would not matter. But can I map all the 3GB as a 2D
texture? (Addressing it as unit would need x,y indexs < 28000.)

I have not yet tried const and restrict or __ldg()

I am kinda depressed about how ineffective the L1 cache has been. Ok its not
an ideal access pattern but for each thread there are up to 8 accesses close
together (<16 words apart). I was thinking the first would cause a cache line
to be loaded but then the following 7 would read directly from it
(without incurring a huge delay).

Check on the texture size limitations for your compute capability. There is a table in an appendix of the Programming Guide that states the limits. You may be able to use a single large texture by using CUDA-arrays instead of normal linear memory. Since LDG gives access to the texture read path, I have not used explicitly bound textures in a while (I don’t typically have need for low-precision interpolation).

The caches on the GPU are tiny, in particular if you consider how many threads share the cache. The amount of cached data per thread works out to a few words. This is a fundamental trade-off GPUs make: By spending only a small fraction of the die area on memory (as opposed to CPUs which spend the majority of the die area on memory), that die area is available for computational resources.

Furthermore, per the Programming Guide appendix G, on sm_3x, the L1 cache is used to cache local memory access only, only L2 cache is used to cache global and local memory. From your description is sounds like going through the texture path should help (reasonable locality of accesses). Also, consider using the profiler to extract memory efficiency metrics and guide the optimization process.

Dear njuffa,
Thanks again. I have got the texture size limit in
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications
One source of confusion (at least for me) is Table 12 gives the “Maximum width and height”
as pure numbers (without units). I think(?) these are limits on the indexes used when fetching
a texel (rather than size in bytes). So a K20c should be ok with x,y up to 28000.

Ah-ha I had forgotten (in compute level 3.x) that the L1 cache is not used for global memory
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability-3-0
This sounds like it is probably why the kernel is so slow. So textures or LDG may be the way to go.

As an alternative, will memcpy (in device code) allow a single thread to prefetch 16 words
as a single (coalesed) read?

Have tried nvvp (cuda 6.0) and it also suggests a problem with memory access (6 GB/sec
rather than in the region of 100). nvvp is way better than it used to be (before cuda 6)
but still fails sometimes.

Thanks again
Bill

I am fairly certain the maximum texture dimensions are stated in units of texels.

Not knowing anything about your application, I am not sure how realistic this is, but you may want to consider changing your data layout, using vectorized loads (e.g. uint4), blocking in shared memory or registers to improve the effective memory throughput.

I am not sure what you mean by “nvvp … fails sometimes”. If you observe behavior that you consider a bug, I would suggest filing a bug report, or an enhancement request if usage is inconvenient, missing a desired feature etc. Generally speaking, the profiler tends to works better with each new hardware architecture due to improved support from the hardware side.

Dear njuffa,
Thanks for confirming the units of the maximum texture dimensions.

Any thoughts on using memcpy() to prefetch inside a kernel device function?
I essentially want to each thread to load up to 64bytes in one go. (I guess this
would count as a vectorized load?)
(My experiment with memcpy went horribly wrong. I was trying to get nvvp to tell me why,
when everything slowed to a crawl. At present I cant say why, but if I get something
definite I will try and log a bug report.) Anyway nvvp is much better than it used to be.

Yip I agree. Its legacy code, so I’m not 100% sure of it yet. It may be the data
can be re-arranged without huge knock on.

Thank you
Bill

Well then each thread can do 4 loads via __ldg() of the uint4 or float4 type. I have actually done this before and indeed it does run significantly faster than If I had used non-vectorized loads of of floats, or did not use __ldg().

Once you load loaded the uint4 or float4 values you can use reinterpret_cast<> to break it down into smaller values for consumption.

I actually have an application where the initial values are byte sized and this enables me to load 16 values in one go and then extract via casting each of the byte sized values.

A uint4 or float4 is just a struct, and the individual components can be accessed as .x, .y, .z, and .w components. I do not perceive a need to use reinterpret_cast<>, but maybe I misunderstand the use case.

Yes, sorry I was talking about the use case of breaking down a uint4 into 16 uchar1 types using reintrepret_cast<> like this:

uint4 temp_load_val=__ldg(&data[idx]);

uchar4 group0=*reinterpret_cast<uchar4 *>(&temp_load_val.x);
val+=float(group0.x)+float(group0.y)+float(group0.z)+float(group0.w);

Dear CudaaduC and njuffa,
Thanks for your help.
It seems __ldg() is the way to go … I will investigate further.

Nobody likes the idea of using memcpy?
Even with ulong4 each thread would need multiple reads. (Each group of reads stradels up to 64 bytes)
but if __ldg() can force reads via a cache it should be ok to rely on the cache
hardware doing a single prefetch from off-chip memory.

Thanks again
Bill

Loads via __ldg() go through what used to be called texture cache and I think is now called the read-only cache. The important thing to keep in mind is that GPU caches are small. Any data pulled into the cache is bound to disappear from it very quickly. So it is advisable to rely on caching as little as possible while improving effective memory bandwidth through regular and suitable (i.e. base + thread_index) access patterns as much as possible.

In case anyone else wants to know more about __ldg() here is one of njuffa’s earlier posts
https://devtalk.nvidia.com/default/topic/527670/why-l1-cache-hit-ratio-become-zero-on-k20-/

Ps: if you get compiler error
error: identifier “__ldg” is undefined
try adding -arch to your nvcc command line

Another quick question if I may: am I right in thinking that __ldg() only works on items
of up to 64bits, (eg long long)?
Thank you
Bill

I use it all the time with 128 bit memory chunks, like the uint4 or float4 type.

The supported types are listed in the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ldg-function

Dear CudaaduC and txbob,
Thank you for rapid and helpful reply.
Going from __ldg reading 8 bytes to 16 bytes has sped my kernel by 31%

In case anyone is interested here is a snippet of my code

#define ldg_t uint4
const ldg_t* p = (ldg_t*)&(global[base]);
for(int x=0;x<16*sizeof(uint32_t)/sizeof(ldg_t);x++) ((ldg_t*)mycache)[x] = __ldg(&p[x]); 
((ldg_t*)mycache)[x] = __ldg(&p[x]);
#undef ldg_t

Bill

Am I right in thinking that __ldg() does not provide any checks on its argument?
Whereas a texture may provide defined values outside the texture’s boundaries.

CUDA-MEMCHECK says
Invalid global read of size 16
Address 0x8003fffc0 is out of bounds
Then gives addresses relating to /usr/lib64/libcuda.so
Unfortunately I am having difficulty relating memcheck’s output
to locations in my (buggy) code.
I tried compiling with -G but that made no difference to memcheck’s output
whereas nvcc -Xcompiler -rdynamic -lineinfo -g -G causes memcheck to include the
names of the device functions being called when the memory access failed.

So actually confirming that the error is an Address out of bounds in __ldg(uint4 const *)

Bill