Kernel launched in for loop with index offset gives incorrect result?

Trying to reduce simultaneous memory usage by splitting up a big kernel. Output is float value for pixel coords.

This is the pre-split code, with correct result:

    dim3 dimBlock(16, 8);

    dim3 dimGrid((int)ceil(((float)w / (float)dimBlock.x)), (int)ceil(((float)h / (float)dimBlock.y)));
    kernel << <dimGrid, dimBlock >> > (vars);

    ...
    
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    dostuff();

This is the split code:

    dim3 dimBlock(16, 8);
    dim3 dimGrid(8, 16);
    int xrept = (int)ceil((float)w / 128);
    int yrept = (int)ceil((float)h / 128);
    for (int i = 0; i < xrept; i++)
    {
        for (int j = 0; j < yrept; j++)
        {
            kernel << <dimGrid, dimBlock >> > (vars, int xoffset=i, int yoffset=j);
            cudaDeviceSynchronize();
        }
    }
...

    int x = blockIdx.x * blockDim.x + threadIdx.x + xoffset * 128;
    int y = blockIdx.y * blockDim.y + threadIdx.y + yoffset * 128;
    dostuff();

First 128 * 128 square is correct, but then it gets messed up. The general shape is still there, but previously smooth edges become gnarly.

What am I doing wrong here?

In both variants insert a check, that x and y is within the limits before doStuff(). The rounding could lead to too large coordinates.

Are the computations of the threads independent? Then you do not need to synchronize in each iteration. If not, then it could lead to errors elsewhere, even with the shown synchronization. From the shown code you can also move the loops inside the kernel.

But perhaps in the actual code you need the synchronization and the loop in host code to copy away the results or provide new input. As you said, you want to save memory. Or is it about temporary memory.

The check is already there. I’m also testing with a fixed 256 * 256 map.

Threads are independent, however synchronizing once after the loop is like 20x slower than synchronizing inside for some reason.

Moving the loops inside the kernel does not solve the simultaneous memory usage problem?

Is the limited memory for data input (from the host), temporary memory or data output (from the device)?

Do you get Cuda errors returned at any time for any call?

Temporary. I need an intermediate 256*256 matrix of values to calculate the output for one pixel.

No errors in this setup, but if I bump the map to like 300 * 300, in kernel malloc fails and I get a 700. The goal is to support bigger maps, so is there some other way to do it?

If you move the loops inside the kernel, each thread can reuse its 256 x 256 memory for each loop iteration.

The memory would not any longer be bound to a pixel, but to a thread.

But should not have an effect on correctness, only possibly on performance.

Do not use in kernel malloc, if it can be avoided (it is messy). Allocate in the beginning of host code.

If your data type is 4 bytes long, e.g. 4 GiB would allow 128x128 only (128*128*256*256*4), 16384 active threads. That does not fill a modern GPU:
For maximum performance (memory size + memory bandwidth + computation), probably more than 1 thread, rather 32 or 256 or 1024, should cooperate to compute 1 pixel, if possible. Perhaps even use tensor cores, depending on the calculations.

Does this also happen, if you skip the first square? Or then the third one is gnarly?

Does it depend on how the temporary memory is initialized?

Perhaps you can store intermediate results of a block (which you can analyze in detail), which succeeds in the first code and fails in the second; or step in with a Cuda debugger?

There could also be a bug or e.g. UB (undefined behavior), which only sometimes shows and which is also within the first code, just not showing.

It’s not easy to use multiple threads as every step uses every intermediate variable, but I’ll give it a go.

Output is wrong whenever any offset is not 0. If I cut it into say 256 * 128, the bottom half will still be wrong but in a slightly different shape. Also the only difference between threads is the x and y values, every other input is global and const.

Lemme try the multiple thread thing first. Thanks!

Except if loading or storing data linearly, Cuda is optimized for handling tens to hundreds of bytes per thread. In your case it is hundreds of kilobytes. You can be happy, if one Cuda block handles 1 pixel at a time without exceeding the available fast memory (instead of one thread).

There are two types of fast memory: Shared memory and registers. Depending on your GPU you have a maximum of 99 KiB of shared memory per block and a maximum of 1 KiB of register space per thread (with 256 threads per block as a maximum in this configuration). And one normally (when it can be easily done, not in your case) tries to stay perhaps at a quarter of those maximum memory sizes to keep everything flowing smoothly.

Which GPU do you use? Is the 256x256 memory with 4 bytes per element? Then you would need 256 KiB of temporary memory per pixel.

How often is each temporary element accessed per computation of a pixel?

Another possibility is to accept that each access goes through the L2 cache. Much slower, but larger. A few MiB size for the whole GPU.

Your algorithm is limited by the memory architecture (sizes + bandwidth), not by computational speed.

You probably will achieve (if you do not need to iterate or do other complications in your doStuff, then it would be much slower, roughly divide the following numbers by the number of iterations) between 100000 to 10 million output pixels per second, although the GPU could compute in the billions per second.

Temporary memory is [w * h] bools and [(w - 1) * (h - 1) / 4] ints. Splitting the map into 128*128 is enough memory for the use case, performance is not great but acceptable.

I think I’ll probably just iterate through pixels on host at this point.

Do the ints have to be 32 bits?

Why is the temporary memory per pixel dependent on the overall image size?

So your maximum (for use of fast shared memory) probably is around 256x256: 65536 bools need 8 KiB of memory, 255255/44 byte/int = 63.5 KiB.

So you can have up to 1024 threads for one pixel, which can help load from and to memory and otherwise do nothing or let 1 thread do the computations.

Ok iterating on host made the whole thing 10x slower. Same for child kernels.

Do the ints have to be 32 bits?

16 bits is not big enough. 24 bits is but that feels like too much trouble. I have the option to substitute the ints with [w * h * 3] bools though.

Why is the temporary memory per pixel dependent on the overall image size?

Final value in a pixel is sum of [w * h] bools divided by (w * h - c). Each bool is calculated based on the x and y value of the pixel, and may be flipped multiple times during calculation.

The main reason I’m struggling with memory is that each int (which can be an individual thread) covers an unknown number of bools, with possible overlap, so I can’t really split these bools to fit into register or smem.

How large do you expect w and h to get?

Combining the bools is a fast operation, which can be parallelized.

So you have the fixed/constant x and y coordinates of an output pixel. From it you generate lots of bit flips and then you sum and normalize?

Could one (in theory, so that I have understood the problem, perhaps even as implementation), generate from a pixel coordinate x,y a stream of bit flips, sort those by bit position, then process each bit (odd or even number of flips)? Or do you need intermediate bit results to generate bit flips? Where do the ints come into play?

Each int represents a line between two pixels. I need to project this line to the edge of the map and get an area defined by the four points (A, B as end points of the line, A’ projected from A, B’ projected from B), then calculate the bool values of each pixel position in this area.
Like so:

If I can get wh to 500 it should be enough.

Could you explain the extent of what has to be done by the algorithm?

For each coordinate in w x h, the algorithm creates many lines A, B. Out of those lines you project A’, B’. This gives an area. Then you flip the bits under the area and in the end sum up and normalize the area? And output the result at the coordinate from the beginning?

Is each coordinate one of A or B (e.g. A)? And the other B is a given list of points, with the same list for each A?

How many points B are there typically?

Step one is iterate through all lines and determine which lines should be considered for the following steps. There are a total of [w * h * 3] lines (ready before the kernel launches), which is why the ints can be substituted by [w * h * 3] bools.

Each A consistently has 3 B’s (horizontal, diagonal and vertical). Only some of the lines qualify, and [w * h / 4] is enough for my data.

Sum and normalization of the value is done for the whole map, after all the lines are processed. Since the projection areas of different lines overlap, a value can be flipped multiple times.

Like I said, performance is already acceptable. I’ll be happy if I can just artificially limit the number of threads running concurrently (without affecting the result). Like I can push it up a little bit by forcing sm_52.

Usually one would store the intermediate data in the following way:

Instead of bools for the area, use the 4 coordinates of the corner points. Testing whether a point is inside the rectangle is a computation, the GPU can do quickly. And you can have any resolution without needing more memory.

To determine the result of the flipping, stream the same rectangles (given as their corner coordinates) to many threads. Each thread is responsible for one output point.

Afterwards the threads combine their flipping data within a so-called reduction to sum the bits up.

The area is not stored, I’m already using the 4 points. The bools are just there to receive the result.

The result is not uniform, so I’m not understanding how I’m to use a stream to do it. It might look something like:

If the bools are not stored as intermediate value, perhaps we are talking about the same thing (which you also implemented)?

I assume each output point is calculated from the same rectangles A, B, A’, B’? With the exception that the point is only repeatedly flipped for rectangles, where the output point lies within the rectangle?

And x and y in your code in the beginning is the x and y coordinate of the output pixel?

What part of the algorithm is also done by the kernel?

The creation (projection) of the rectangles?
The determination whether an output point lies within?
The flipping?
The storing of the result?

If the second code with offset and host side loops gave wrong results, one of the steps in the kernel did not work correctly.

Ok I’m an idiot. Apparently the values aren’t reset to 0 when device memory is freed and reallocated. Thanks lol