Need help writing a CUDA kernel for image shifting

Consider the following C code:

void Shift(uint8_t * const dst, uint8_t const * const src, const int xs, const int ys, const int dx, const int dy)
	for (int y = 0; y < ys; y++) {
		for (int x = 0; x < xs; x++) {
			for (int c = 0; c < 3; c++) {
				int y2 = (y + dy + ys) % ys;
				int x2 = (x + dx + xs) % xs;
				dst[(y * xs + x) * 3 + c] = src[(y2 * xs + x2) * 3 + c];

In this code src is the source RGBI image, dst is the destination RGBI image.

Source image dimensions are xs (width) and ys (height) specified in pixels, and they aren’t necessarily power of two or divisible by 8. Function is called like this:

for (int dy = 0; dy < 8; dy++) {
	for (int dx = 0; dx < 8; dx++) {
		Shift(dst, src, xs, ys, dx, dy);

I did write some simpler kernels before, but I must admit I haven’t the slightest idea how grids, blocks, and threads would play into this kind of code.

Perhaps NPP already has a function that can do this and I am trying to reinvent the wheel?

Any help would be greatly appreciated.

I wonder what the purpose of adding ys before doing the modulo by ys is. It seems to me that

(x+a)%a == x%a

But maybe there is some corner case I’m not thinking about.

Writing a CUDA kernel for this shouldn’t be too difficult to get something that is reasonably performant.

For a 2D operation, its a common practice to create a 2D grid of 2D threadblocks, and use canonical 2D index generation in kernel code. The body of the kernel then is created from the body of those 2 for loops. Given that you have a loop in c which would break coalescing, we can further “parallelize” the horizontal dimension.

__global__ void kShift(uint8_t * const dst, uint8_t const * const src, const int xs, const int ys, const int dx, const int dy){

  int ky = blockIdx.y*blockDim.y+threadIdx.y;  // canonical 2D index generation
  int kx = blockIdx.x*blockDim.x+threadIdx.x;
  int y = ky;
  int x = kx/3
  int c = kx - (x*3);
  if ((y < ys) && (x < xs)){
    int y2 = (y + dy + ys) % ys;   // the body of your for-loops
    int x2 = (x + dx + xs) % xs;
    dst[(y * xs + x) * 3 + c] = src[(y2 * xs + x2) * 3 + c];}

For your kernel launch, size your grid like this:

dim3 block(32,32);
dim3 grid(((xs*3)+block.x-1)/block.x, (ys+block.y-1)/block.y);

coded in browser, not tested

@Robert_Crovella Thanks for the code example. I will definitely test it.

Can you clarify what should be the number of threads for the kernel launch, and why you picked 32x32 as a block size?

That is provided for in the block and grid variables I gave. You use them like this:

mykernel<<<grid, block>>>(...);

If its still unclear, I refer you to this CUDA training series

32 threads in the x dimension is useful to promote coalesced access. 32 threads in the y dimension is not critical, you could choose another number like 16, which might actually be faster on some GPUs. If its still unclear, I refer you to the previously mentioned training series, and there are also numerous questions already on various forums covering various aspects of threadblock sizing.

Regarding your question on NPP, there are functions in NPP to copy images or 2D regions, but I don’t know of any that have this modulo/wrap behavior. They may exist of course. NPP is a fairly large API and I don’t have it all memorized. There is documentation available.

In examples and docs I saw that in <<<n, m>>> syntax m denotes threads, and it was usually named accordingly so I didn’t immediately connect the name block you provided with that second argument.

Thanks for the documentation links, I did already read some of it but there is still a lot of new stuff and concepts to absorb.

@Robert_Crovella I forgot to answer this, sorry:

The purpose is for situations where dx and dy are negative. Since I can’t edit the original post for some reason, the calling code should look like this:

for (int dy = 0; dy < 8; dy++) {
	for (int dx = 0; dx < 8; dx++) {
		Shift(dst, src, xs, ys, dx, dy);
+		Shift(dst2, dst, xs, ys, -dx, -dy);

Would the code example you wrote still work correctly in such a case?

Yes. I hadn’t changed that arithmetic at all.

Thank you Robert, the code is working. I marked it as a solution.

Indeed. In (modern) C++, the modulus of a negative number by a positive number is negative, or zero. TIL. (Some humor.)

@Robert_Crovella I must admit I also wasn’t aware of that peculiar bit of C++ standard, thanks for linking to it and I am totally stealing that curse – it is almost as good as the one “I hope you step on a Lego” (bare feet are implied).

While somewhat humorous, the linked posting is presenting a skewed history.

Long before there was an x86 architecture, the division instructions in IBM’s System/360 and DEC’s VAX were already defined in the way that makes the most sense in hardware: (Signed) integer division truncates the quotient (that is, rounds towards zero) and therefore the remainder returned by the division instruction always has the same sign as the dividend.

When division instructions were added to microprocessors later, both Intel’s x86 line and Motorola’s 68K line adopted these pre-existing semantics. There is an old adage in engineering (and banking) that it is more important to be consistent (e.g. arrangement of keys on keyboard) than “correct”, and that may have influenced this decision . Note that the x86 architecture back then was just one architecture of many.

The original ANSI C standard from 1989 left the rounding mode of integer division open, requiring only that (a/b)*b+a%b == a. Thus the sign of the result of the % operator was left open as well. This suggests that there must have been hardware architectures that chose a rounding mode other than truncation for integer division; however, I am not aware of any such architectures. In ISO-C99 the rounding mode for integer division was fixed to truncation, and as a consequence the sign of the of the % operator matches the dividend. C++ simply inherited arithmetic semantics from C, again for reasons of consistency.

Mathematicians have been moaning for decades about the dominant hardware convention, as it does not match the mathematical concept of modulo, which is always positive.

@njuffa I mostly agree with you.

However, if the C/C++ language modulo definition matched mathematical definition, and if C/C++ compilers implemented modulo operation suboptimaly using several CPU instructions instead of doing 1-to-1 mapping to incorrect hardware implementation, then as the usage of correct software implementation increased, the CPU vendors would have been forced to introduce new instruction that does said operation correctly and efficiently in hardware.

I think that is quite doubtful. As GPUs (and new CPU architectures like RISC-V) demonstrate, the decades-old desire to keep the hardware as simple as possible continues and arguably has even intensified. For example, GPUs do not even have any division instructions: both integer and floating-point divisions are implemented purely in software. Truncation is the simplest possible rounding mode, one cannot make hardware any simpler than that.

In other words, where mathematical modulo is desired, software folks are (and have always been) free to implement it in software. Maybe a modulo capability will eventually be added to C and/or C++ as a standard feature, although I have not yet seen such a proposal among the many that are submitted to the relevant ISO committees. Why this did not happen during the genesis of the C language is clear: The language was supposed to provide a thin shell around the hardware capabilities of CPUs at the time, in particular DEC’s popular PDP-11 line.

#include <stdio.h>
#include <stdlib.h>

/* compute mathematical modulus based on Euclidean division */
__device__ int math_modulus (int dividend, int divisor)
    int res, rem;
    rem = dividend % divisor;
    res = (rem < 0) ? ((divisor > 0) ? (rem + divisor) : (rem - divisor)) : rem;
    return res;

__global__ void kernel (int a, int b)
    printf ("% d mod % d = % d\n", a, b, math_modulus (a, b));

int main (void)
    kernel<<<1,1>>>(-5, 3);
    kernel<<<1,1>>>( 5, 3);
    kernel<<<1,1>>>( 5,-3);
    kernel<<<1,1>>>(-3, 5);
    kernel<<<1,1>>>( 3, 5);
    kernel<<<1,1>>>( 3,-5);
    return EXIT_SUCCESS;

@njuffa I disagree about CPUs, Intel and AMD have added quite a few specialized instructions in the last two decades or so.

For example, people were counting number of set bits (pseudo-code):

Count = 0;
For (i=0; i < OperandSize; i++)
{ IF (SRC[ i] = 1) // i’th bit
    THEN Count++; FI;
DEST ← Count;

That has been a single CPU instruction (POPCNT) for quite a while now, and the same goes for many other (packing, unpacking, string comparison, etc) operations which can now be expressed with a single CPU instruction.

If you are working with SIMD, there are even scatter / gather instructions which take indices into an array from a SIMD register and coalesce reads into (or writes from) another SIMD register. I personally suggested those to Intel CPU engineers long time ago at one of the IDF conferences to accelerate various 3D calculations which require noncontiguous memory access.

Before those were added you had to read individual noncontiguous float or DWORD values from memory and then pack them into a SIMD register using several instructions and loops using such memory access pattern were problematic for auto-vectorizer. First implementation did not have better performance, but it made the assembler code more readable and made it easier for their C++ compiler to vectorize code. Subsequent implementations in new CPUs were refined and now those single scatter / gather instructions are faster.

Therefore, I believe that adding proper mathematical modulo would have been a priority if it was standardized and done in software and later recognized as inefficient and impactful.

As for GPUs having division instruction or not, it probably depends on how you define “instruction” – PTX documentation clearly lists div as an instruction.

You can argue that the PTX instruction is virtual and gets translated to actual code run by the GPU, but so are pretty much all x86 CPU instructions – they are decoded from their opcode representation into several architectural uOPs (some of which may require microcode assist to execute) so for all intents and purposes, div on x86 is also virtual and implemented in software which executes on the underlying RISC architecture itself.

It seems we’ll have to agree to disagree, as our design philosophies seem to be quite different.

When we are talking x86, we are talking about a 45 year old architecture that started as CISC and to which multiple kitchen sinks have been added since. I have lost track of how many instructions there are, I think in excess of a 1000 by now.

This is insanity, with various non-orthogonalities at every new stage of development, and I am saying that as someone who worked on multiple x86 processors (the K6-line and the Athlon processors at AMD; I was one of the designers of the 3DNow! SIMD instruction set). I know how x86 instructions are mapped to internal RISC-like ops via hardware or microcode having designed some of those internal ops and having written some of that microcode.

When creating new architectures, the continued trend is towards simple designs, even though there are aberrations like expansive SIMD instruction sets, which I consider an architectural dead end. The people who continue to expand these are on the wrong track, while NVIDIA is on the right track with SIMT (IMHO; obviously I am strongly biased because I worked on and with CUDA at NVIDIA for nine years since its inception in July of 2005)

Simpler architectures may contain a few specialized instructions where this is worthwhile, like POPCNT. And yes, I am fully aware how to compute that in software; faster than what you have shown. But that is not fast enough for people who really need that functionality.

The ISAs of NVIDIA’s GPUs are particularly simple 32-bit designs with 64-bit addressing extensions, and no, the myriad emulated instructions exposed at the PTX level do not count.


I’d say that non-orthogonality, which is not exclusive to instruction sets but also exists in APIs and frameworks (see for example native WinAPI .vs. .Net), is the result of new developers not having a full grasp of the ISA/API they are trying to extend, not a conscious decision to break it.

Regarding expansive SIMD instruction sets, I wrote my fair share of hand-optimized SIMD code from MMX to AVX2 and I believe they have their use and that data level parallelism should be exploited in addition to other techniques such as blocking and thread level parallelism. That said, I believe they are already at their apex, and that adding more specialized SIMD instructions would yield diminishing returns.

On the other hand, every scientist and student has a CPU, while not every scientist and student have a (CUDA compatible, powerful, and affordable enough) GPU. That means that they can easily benefit from speeding up their calculations using SIMD and threading (even by just using auto-vectorization and auto-parallelization of modern compilers without requiring SIMD domain-specific knowledge), while they need considerable investment to be able to do so with a GPU. When considering what I just wrote, please also consider 3rd world countries with low salaries and low tech availability, not just the USA.

As for NVIDIA GPU ISA being RISC and exposing myriad of emulated instructions via PTX versus x86 CPU ISA being RISC (for at least the last decade if not longer) and exposing myriad of emulated instructions via x86 assembly – frankly I don’t see a meaningful distinction so unless you can demonstrate it I will have to agree to disagree.

Finally, this snark:

And yes, I am fully aware how to compute that in software; faster than what you have shown

was totally uncalled for, given that I clearly labeled the example as pseudo-code (which I copied verbatim from the instruction manual), not to mention that both clang and gcc will detect popcnt idiom and replace it with POPCNT instruction on your behalf so there’s no need to boast how you can beat the pseudo-code.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.