In the CUDA device code, I have to calculate integer division. I wonder if it is possible to replace it with floating-point division. For example, replace the function
__device__ int div(int x, int y) {
return x / y;
}
by
__device__ int div(int x, int y) {
return (int)((float)x / (float)y);
}
The reason is that from the NVIDIA forum, I learn that integer division needs hundreds of clock cycles while floating-point division only needs about ten cycles. Another reason is that my kernel faces high register pressure. I know that integer division consists of multiple instructions and I guess it may use a lot of registers. But I wonder if the conversion between integer and float is efficient.
Some additional information: both x and y in my program are positive integers and are not very large (no more than 1e5). So I suspect the precision of floating-point is enough. Also, y is not a constant so the compiler can not optimize the integer division. Finally, my kernel is computation bounded (not memory bounded) and there is an integer division in the for loop, so I think it is important to make it faster.
I don’t think that reflects the situation accurately. I assume we are talking 32-bit integer division here? Which GPU architecture?
NVIDIA GPUs have no hardware support for divisions. All division operations are emulated, and there are multiple versions of integer division and multiple versions of floating-point division. The emulation sequences eat up instructions and register for temporary variables.
If you want to know whether replacing integer division with floating-point division is beneficial to performance in your use case, I would suggest to simply try it. That should be a 5 minute experiment, given that there is a single division in the kernel that needs replacing.
As your range of possible divisors y is limited, you might create a table of inverses stored in a fixed point precision and replace the division with a table lookup, followed by multiplication. You are probably going to end up having to do 64 bit multiplications but that is still fairly fast.
Alternatively store the inverses as FP32 or FP64 constants and do this multiplication in floating point, casting back to int for the result.
The inverse table may fit into constant memory, which offers cached access.
Here is a special division function for unsigned integers less than 100,000 called udiv_lt_100000, together with the test code that checks its correctness exhaustively. I added a define USE_OWN_CVT to turn on float/int and int/float conversions that bypass the GPU conversion instructions. Whether that is faster likely depends on GPU architecture.
#include <stdio.h>
#include <stdlib.h>
#define LEN (100000)
#define THREADS (128)
#define USE_OWN_CVT (0)
// Macro to catch CUDA errors in CUDA runtime calls
#define CUDA_SAFE_CALL(call) \
do { \
cudaError_t err = call; \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
// Macro to catch CUDA errors in kernel launches
#define CHECK_LAUNCH_ERROR() \
do { \
/* Check synchronous errors, i.e. pre-launch */ \
cudaError_t err = cudaGetLastError(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString(err) ); \
exit(EXIT_FAILURE); \
} \
/* Check asynchronous errors, i.e. kernel failed (ULF) */ \
err = cudaDeviceSynchronize(); \
if (cudaSuccess != err) { \
fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
__FILE__, __LINE__, cudaGetErrorString( err) ); \
exit(EXIT_FAILURE); \
} \
} while (0)
__forceinline__ __device__ float rcp_approx_gpu (float divisor)
{
float r;
asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
return r;
}
/* integer division for dividend and divisor < 100,000 */
__device__ unsigned int udiv_lt_100000 (unsigned int x, unsigned int y)
{
const float magic = 1.2582912e+7f; // 0x1.8p+23
const unsigned int magic_i = __float_as_int (magic);
#if USE_OWN_CVT
float divisor = __int_as_float (magic_i | y) - magic;
float dividend = __int_as_float (magic_i | x) - magic;
#else // USE_OWN_CVT
float divisor = (float)y;
float dividend = (float)x;
#endif // USE_OWN_CVT
float t = rcp_approx_gpu (divisor);
t = __int_as_float (__float_as_int (t) + 1); // make sure quotient is never too big
t = t * dividend;
#if USE_OWN_CVT
t = __fadd_rz (t, magic);
unsigned int q = __float_as_int (t) & 0x1ffff;
#else // USE_OWN_CVT
unsigned int q = (unsigned int)t;
#endif// USE_OWN_CVT
return q;
}
__global__ void divtest (unsigned int *result, int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
for (int i = tid; i < len; i += stride) {
unsigned int dividend = i;
unsigned int flag = 0;
for (unsigned int divisor = 1; divisor < LEN; divisor++) {
unsigned int ref = dividend / divisor;
unsigned int res = udiv_lt_100000 (dividend, divisor);
flag += res != ref;
}
result [i] = flag;
}
}
int main (void)
{
dim3 dimBlock(THREADS);
int threadBlocks = (LEN + (dimBlock.x - 1)) / dimBlock.x;
dim3 dimGrid(threadBlocks);
unsigned int *r = 0, *d_r = 0;
r = (unsigned int *) malloc (sizeof (r[0]) * LEN);
CUDA_SAFE_CALL (cudaMalloc((void**)&d_r, sizeof(d_r[0]) * LEN));
CUDA_SAFE_CALL (cudaMemset(d_r, 0xff, sizeof(d_r[0]) * LEN)); // all ones
divtest<<<dimGrid,dimBlock>>>(d_r, LEN);
CHECK_LAUNCH_ERROR();
CUDA_SAFE_CALL (cudaMemcpy (r, d_r, sizeof (r[0]) * LEN, cudaMemcpyDeviceToHost));
for (int i = 0; i < LEN; i++) {
if (r[i] == 0xffffffff) {
printf ("counter not written i=%d\n", i);
return EXIT_FAILURE;
}
if (r[i] != 0) {
printf ("division failures i=%d r=%u\n", i, r[i]);
return EXIT_FAILURE;
}
}
CUDA_SAFE_CALL (cudaFree (d_r));
free (r);
return EXIT_SUCCESS;
}
Thanks a lot! Yes, as is written in the two functions, the type is int (32-bit). The GPU is RTX 3090 (Ampere). It seems that using floating-point division indeed reduces the register usage.
Generally an unsigned integer division will be slightly faster than a signed integer division of the same width. There is additional overhead to get the sign handling correct.
For Ampere, the unsigned 32-bit integer division seems to be an inlined sequence of 17 instructions of which 7 are IMAD. This seems to eat up 9 registers (versus the 3 that would be required if there were a hardware division instruction). The fast path of the float division is a called subroutine of ten instructions, plus CAL/RET makes 12 instructions. It appears to use 7 registers. Using it for integer division adds three more conversion instructions for a total of 15.
Therefore, on Ampere, using the floating-point division improves throughput by an estimated 25% and saves two registers compared with the unsigned int division. My special udiv_lt_100000() division routine requires ten inlined instructions when compiled with USE_OWN_CVT = 1 and needs only 5 registers. It should double the throughput compared to the built-in unsigned int division (estimate) and reduces register usage by 4.
Thanks so much! Now I have a clear picture about the speed of different situations.
I just figured out one additional thing: for my kernel the divisor in the for loop is kept the same (although it is different for various kernel launches). So in this case is there any trick that can further speed up the calculations? (For example, @cbuchner1 mentioned that we can pre-calculate the inverse of the current divisor but I am not clear how it compares with floating-point operations in your implementation). Thank you again!
If there is more than one instance of the same divisor used in the kernel, the compiler should be able to exploit that through common subexpression elimination (CSE), provided the division code is inlined. Since 32-bit integer division is inlined code for Ampere from what I can see, you should be good. What, if any, optimizations the compiler can perform across a function boundary in the case of the ‘float’ subroutine, I do not know.
Computing the reciprocal will use the MUFU.RCP instruction no matter whether integer o floating-point division. It is basically a table-lookup plus interpolation on an internal ROM table, so my expectation is that any manual table lookup to replace it would provide only small improvement. The easiest way to find out is to implement variants and then profile.
If you are willing to learn to read assembly language, you can use cuobdump --dump-sass to see what the compiler does transform your code to. Constant propagation and CSE are some of the beard & butter optimizations the CUDA compiler will aggressively pursue.
I would have thought that for a reused floating point divisor, computing the reciprocal and then converting the divisions to multiplications would be an obvious win. (…hasty benchmarking…) But sure enough, it isn’t. Nice work, compiler!
The way the CUDA toolchain has matured, it is usually a good idea to write CUDA code in a straightforward natural manner. Only after finding proof that the computer is doing something sub-optimal should one engage in manual source code re-writing. What compilers in general cannot do is optimize based on numeric range limitations (as is the case here). In most programming languages such constraints cannot even be expressed.
The matured CUDA compiler is capable of transforming source code so extensively that I found on more than one occasion that I have serious trouble trying to follow the resulting SASS and identifying bits and pieces of the computation I specified at HLL level, even though I have lots of experience reading SASS.