I need to do 1D lerp from a great number of 1D float vectors (maybe up to millions). Every vector is of the same length, typically several hundred to a thousand. Every block will process many vectors (randomly), sampling points inside vectors are usually dense, but no locality between vectors.
I want to utilize the texture units the achieve this. However, one 1D-layered tex object can support an array up to 16384x2048. Then I will need hundreds of tex objectes to create. Nevertheless, vectors will be updated in another kernel for another lerps. Since layered textures are only supported by cudaArray, re-copying seems mandatory. I was considering updating the cudaArray by surface objectes, but it does not support atomic reduction, which is needed in my algorithm. Currently I’m trying to use arrays of 2D texture objects with pitched linear memory, but I’m not sure it’s the right choice.
When I was considering those possible approaches, I realized I knew so few about details deep inside texture memory, thus I have some statements according to my understanding. It’s really appreciated that someone who has better knowledge about this could help to confirm them, or clarify my misunderstandings. Thanks very much!
Here are my understandings and questions:
The elements of cudaArrays follow Z-order curve pattern (at stated by another thread here). Thus even the interpolation will never across layeres, the locality between layers are still needed for efficient use of memory bandwith. This is also the case for 2D texture with cudaArray, even when the lerp is always along one axis (say, one coordinate is i+0.5f). Not sure for 2D texture with pitched memory.
The bandwith of texture memory is shared with normal global reads. The texture is likely to save the bandwidth only for cache hits, or better coalescing due to spacial locality of Z-order curve.
As stated in “https://devblogs.nvidia.com/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/”, a kernel can support up to 1 million texture objects. I’ve checked the value of cudaTextureObject_t, it’s numbered sequencially as 1,2,… There must be some context dependent variables initialized when creating them. Is that possible to call cudaCreateTextureObject or cudaDestroyTextureObject with a long continous sequence rather than one by one? The resource descriptions are identical, the only difference is the pointer address.
Actually, if statement 1 is not true, using layered texture may behave identically with texture object array, probably better performance. Since creating many texture objects needs more context resources.
The precision of texture interpolation is 8bit, regardless of the coordinats range. Is that possible to have better precision?
The texture unit is a common and powerful approach in graphics, but sometimes we need more flexibilities to better utilize it in scientific computing.
No. The internal computation uses a 1.8 fixed-point format as spelled out in an appendix to the CUDA Programming Guide. For many non-graphics tasks I would suggest doing your own interpolation based on fmaf(): https://devblogs.nvidia.com/lerp-faster-cuda/. Single precision FMAs are too cheap to meter!
While textures were important for boosting performance in the past due to the way the cache hierarchy was implemented in early GPUs, I would claim that this is often no longer true today. In other words, just programming in a natural way, while paying attention to “const” and “restrict” modifiers is often fully sufficient. If you get desperate you might consider dropping in a few __ldg() intrinsics here and there where applicable. I would suggest giving that a try with a prototype implementation and then let the CUDA profiler guide you to further optimizations.
If you want to do 1D interpolation on 1D float vectors, I would use a 1D texture or surface. If the total extent of your vectors does not exceed the maximum 1D texture object dimension, its not obvious to me why you could not use a single texture object for everything.
Having said that, all this effort might yield no perf benefit compared to what njuffa is suggesting.
Thanks njuffa and txbob! You two are always that kind to help~
Actually doing manual lerp was my first choice. The profiler told me the program was compute bound. I have a few other arithmatic operations besides the lerp, thus I was wandering whether it’s beneficial to move the lerp to the texture unit, and save more than ten instructions for address calculation and lerp. Since my inner most loop has just roughly twenty~thirty instructions(partially unrolled though), I was expecting some performance promotion since the texture lerp has dedicated unit to run with ALU parallelly, providing I have enough threads to cover texture read latencies. But now it seems not trivial to achieve this.
Well, I don’t think the lerp makes sence for large 1D float vectors. The maximum extent for 1D texture is 2^27, but float only has 23bit precision, which can hardly locate the nearest element with that extent, needless to say about the lerp. For me, it only makes sence when the texure is fetched with interger indices. What’s more, the extent of my volume could be up to 1024^3, which is a million vectors with length 1024! That’s why I’m look for array of texture objectes, or using 2D/3D layered texture but only do lerp along one axis. What bothers me here is I think the wasting of bandwith will probably eats up the benefit earned by instruction reduce. I don’s quite know the extra cost of texture loads, probably that matters too.
Hmmm… That enlightens me~ I’ll try to work with it, providing the texture really works fine for this case~
That means there must be a lot of computation other than the linear interpolation addressed here. Have you looked into how you might simplify that other computation? Is it floating-point computation that could benefit from -use_fast_math? Have you exhausted possible speed-ups from loop unrolling, function inlining, etc? Have you searched the Best Practices Guide for any items that might be applicable to your use case?
The other computations are quite simple, tens of ffma/fmul/fadd operations. I don’t think there can be enough potential performance benefit here. FTZ seems have negligible effects for my case.
BTW, now I realize I may missed one point here. I made a simple microbenchmark, and found that the latency of texture reads seems not reduced with cache hit, which made the threads stall much more cycles before arithmatic operations, comparing with the direct memory access approach. I still need more meticulous tests to check whether this is really the case.
I assume you have already taken care to use FMA instead of individual FADD/FMUL as much as possible? The CUDA compiler will aggressively convert FMUL followed by dependent FADD into FMA, but it will not re-arrange the computation mathematically.
For example it will not turn (1.0 - r) * a + r * b into fma (r, b, fma (-r, a, a). This conservative approach preserves the intentions of programmers, who might have chosen a particular arrangement for its numerical properties, given that floating-point arithmetic is not associative. But it differs from the extensive re-association often used by host compilers, sometimes even by default.
Thanks for your suggestion~ But I’m affraid those techniques are applicable there, except the lerp itself you’ve mentioned.
Actually doing the lerp is only a small portion of the interpolation. If I have to do a texture load “tex1D(texObj, pos)” manually, I need to split pos to integral part and fraction part, and compute the address using integral part, and then load two elements, and finally do the lerp. that’s more than ten instructions (including some low throughtput instruction F2I, F2F, etc.). That’s why I’m trying to reduce this part of work.
BTW, after my meticulous tests, the texture reads have lower latency for cache hit, but still much longer than normal memory load (almost doubled?), so I don’t find so many benefits using texture here. I will switch back to normal reads.
Yes, I am familiar with that kind of code. Since you seem to be tweaking for the last few percentage points of performance here, it would probably be worthwhile to check whether limiting the depth of dependency chains, increasing instruction parallelism, and mixing integer with floating-point work can buy you anything. The work you describe can usually be expressed in a variety of slightly different ways. Note that float<->int type conversion can sometimes be accomplished by FADD plus re-interpretation of bits, which is free on the GPU since integer and floating-point data is stored in the same registers.
Example 1: For ‘float f’, 0 <= f < 223, ‘int(f)’ can be computed as ‘__float_as_int (__fadd_rz (8388608.0f, f)) & 0x007fffff’.
Example 2: For ‘float f, intg’, 0 <= f < 223, the integral portion ‘intg’ of ‘f’ can be computed as ‘intg=truncf(f)’ or as ‘intg=__fadd_rz(8388608.0f, f)-8388608.0f’
Example 3: For ‘float f, frac’, 0 <= f < 2**23, the fractional portion ‘frac’ of ‘f’ can be computed as ‘frac=f-truncf(f)’ or as ‘frac=f-(__fadd_rz(8388608.0f, f)-8388608.0f)’
If you need to add 0.5f to the integral portion to address a texture (to hit the center of the texel), you may be able to manually combine that addition with the subtraction in the conversion. If you need to scale ‘f’ prior to conversion ro splitting, you can use __fmaf_rz() instead of __fadd_rz(), applying the scale factor without needing an additional instruction.
It’s been almost ten years since I last went through that exercise using textures for underlying storage, but I recall differences +/-5% for the various flavors. This may have been due to compiler artifacts more than due to microarchitecture. I also recall doing some manual strength reduction as the compiler won’t do that for floating-point computation for the reasons I stated earlier.
Feel free to post the relevant block of code if you can/want.