Re_arranging Cuda array

Hi guys,

I am currently writing a code in cuda that rearranges a large 1 D array.
This is part of a bigger code where we stream data from a camera straight into the GPU, does data processing in the GPU, and then saves the processed data. This step is necessary since the camera produces 8 GB per second, making saving the raw data not a favourable option.

The issue is that the raw data (from the camera) is scrambled, and the rows have to be rearranged according to a certain pattern before it can be used.

Currently i have three different codes to do so (actual codes included below):
A cuda kernell that code can handle about 1 gigabbyte per second of image data.

__global__ void rearrangetosave(void* arr1, void* arr2, void* vector1, size_t y_height, size_t x_width) {
    uint16_t* intArray = (uint16_t*)arr1;
    uint16_t* intArray2 = (uint16_t*)arr2;
    uint16_t* vector11 = (uint16_t*)vector1;
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < y_height; i += stride) {
         int v11_idx = vector11[i];
        int src_base_idx = x_width * v11_idx;
        int dst_base_idx = x_width * i;
        for (int j = 0; j < x_width; j++) {
            intArray2[dst_base_idx+ j] = intArray[src_base_idx + j];
        }
    }
}
  • This version uses cudaMemcpy and can handle about 1.4 gigabyte per second of image data.
 for (int m = 0; m < microscopeheight; m++) {
            int start = microscopewidth * vectoryoumade[m];
            int end = start + microscopewidth;
           int startArranged = m * microscopewidth;
           cudaMemcpy((uint16_t*)unpacked_triggeron +startArranged , (uint16_t*)start , microscopewidth * 2, cudaMemcpyDeviceToDevice);
        }
  • A loop that uses sdt::vector copy on the cpu that can handle about 5.6 gigabyte of imaging data per second:
for (int m = 0; m < microscopeheight; m++) {
            int start = microscopewidth * vectoryoumade[m];
            int end = start + microscopewidth;
           int startArranged = m * microscopewidth;
   std::copy(arraytorearrange.begin() + start, arraytorearrange.begin() +end, rearrangedarray.begin() +startArranged);
}

The problem i am facing is that we would prefer to keep the data analysis on the gpu, as the next step is to do fourier transform. However, our current codes are too slow at the rearranging process.
My question is, how can i either make cuda copy an entire row in the kernel, instead of going data point by datapoint? Or how can i multithread the process more efficiently, and increase the speed that way?
Currently, I can only multithread one of the two loops, without running into trouble.

Additionally, this is how the vector is produce, and images used are 1984 pixels by 1984 pixels.

  std::vector<uint16_t> vector;
  std::vector<uint16_t> vector2;
  int my_array[] = { 2,18,6,22,3,19,7,23,10,26,14,30,11,27,15,31 };
  int my_array2[] = { 0,16,4,20,1,17,5,21,8,24,12,28,9,25,13,29 };
  for (int N = 0; N < microscopeheight / 32; N++) {
      for (int i = 0; i < 16; ++i) {
          vector.push_back(N * 32 + my_array[i]);
          vector2.push_back(N * 32 + my_array2[i]);
      }
  }
  std::reverse(vectoryoumade2.data(), vectoryoumade2.data() + vectoryoumade2.size());
  vector.insert(vector.end(), vector2.begin(), vector2.end());

Please use proper markup for code, it becomes unreadable otherwise. For markup of code blocks, select the relevant section, then click on the </> button at the top of the editor input box.

What GPU hardware do you use? 1000 GB/s is 1 TB/sec. Only the highest end GPUs provide that kind of memory bandwidth. Assuming the number is correct, where is the problem if the raw data rate of the camera is 8 GB/s?

The use of uint16_t in the code suggests each pixel comprises 2 bytes, so this would correspond to 7.5 MB per image. Since the camera streams data at 8 GB/s, this means the GPU is processing around 100 images per second, correct?

Two general thoughts on bulk memory copies:

(1) Generally speaking, when data throughput is critical, wider accesses are preferred over narrower accesses. Is there any reason you could not use one of CUDA’s vector types, such as ushort8 to process multiple pixels at one time?

(2) Generally speaking, pure bulk memory copies that do not include any computational processing are wasteful and best avoided, on any processing platform. It is better to incorporate data movement into computational steps that need to happen anyhow (this may slightly complicate the code). Is there no chance to do that here?

How is that relevant here? Is the camera data delivered to the host system and then copied down to the GPU, or is it delivered straight to the GPU? If it is the latter, any processing of image data on the host would require copying between host and GPU, which is not particularly fast if this is a discrete GPU connected via PCIe.

FWIW, a transfer rate of 5600 GB/sec for a 7.5 MB image suggests you are measuring copy throughput inside the CPU’s L3 cache. Typical copy throughput of system memory even of high-end CPUs would not exceed a few hundreds of GB/s. Given the streaming nature of the use case, how likely is it that data would actually already be resident in L3 at the image scrambling step? Unless there is re-use from other code, it seems that would never be the case, that is, processing would have to rely on the host system memory bandwidth instead.

Apologies, I have altered the first post.
And yes, i miscalculated by a factor of thousand, the numbers should be the correct size now.

I am using a NVIDIA RTX A4000 GPU.

(1) Yeas that would be a good idea, it definitely is possible, the issue is, and this ties into the second point. The next step in the code would be to do a FFT transform, and I dont know how to feed an ushort8 int to a cuda FFT plan.

(2) Well, ideally speaking, I would combine it with the FFT transform I want to do afterwards, and tell the cuda code that pixel[x,y] actually should be read as pixel[x,vector[y]].
Unfortunately I do not know how to implement this, but if it is possible, that would by far be the best solution to my Issue.
Currently I am transforming the data from the uint16_t format into a cudaReal array, so that it can be send into the cuda FFT at the same time. However, I can speed up transforming the data form uint16_t to a CufftReal significantly, If I dont have to rearrange it.

Would it be possible to tell the cufft plan to read the data differently, and just forgo the etire rearrangement thing? It already has to do something similar computationally, to be able to perform the fft itself.

This is how I currently copy the data into a cufftReal

 **global** void rearrangetosave(void* arr1, cufftReal arr2, void* vector1, size_t y_height, size_t x_width) {
   uint16_t* intArray = (uint16_t*)arr1;
   uint16_t* vector11 = (uint16_t*)vector1;
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = index; i < y_height; i += stride) {
      int v11_idx = vector11[i];
      int src_base_idx = x_width * v11_idx;
      int dst_base_idx = x_width * i;
      for (int j = 0; j < x_width; j++) {
         arr2[dst_base_idx+ j] = intArray[src_base_idx + j];
      }
   }
}

As long as alignment requirements are met, one can convert between unsigned short * and a ushort8 * as required by the local code. On GPUs, objects of size N bytes must be aligned to an N-byte boundary (so called natural alignment). So your copy code can treat the data as ushort8 while the FTT sees unsigned short data.

I have zero experience with CUFFT. I think it offers programmer defined callbacks, but I do not know whether that can be used to change the data access pattern. My guess is that this is not the case. FFTs are generally constrained by bandwidth and not computation, so the FFT code is presumably written for maximum memory throughput. But it would not hurt to take a look at the documentation.

Re-arranging memory layout through index vectors really is not a good fit for the memory subsystem of a GPU, which prefers linear access across threads (sometimes called the “base plus thread-index” access pattern). Why does the camera provide frames in such an inconvenient format? Is this something you can change or configure? Or is this simply a third-party device to which your code must adapt?

The RTX A4000 has a theoretical memory bandwidth of 448GB/sec according to the TeckPowerUp database. Realistically achievable bandwidth should be about 85% of that, or 380 GB/sec. Even with access using one level of indirection through an index vector I have a hard time seeing how this results in an effective bandwidth of 1.4 GB/sec, or half a percent; it seems way too low to me. What is the total number of threads in the kernel configuration for the copy kernel?

In the meantime, you may want to take a closer look with the CUDA profilers (Nsight system and Nsight compute).

As long as alignment requirements are met, one can convert between unsigned short * and a ushort8 * as required by the local code. On GPUs, objects of size N bytes must be aligned to an N-byte boundary (so called natural alignment). So your copy code can treat the data as ushort8 while the FTT sees unsigned short data.

I see, I it most definitely is a good thing to implement, the issue is that CUFFT only takes cudaComplex or CudaReal, and conversion into that would still be necessary.
However, and this brings me back to the initial problem. I cant multi thread the rearrangement properly. As seen in the codes snippets above, I can only multithread the process in the x or y direction, and still get a proper image out. And here is the reason that the process is so slow. The GPU can handle 1024 threads per block, and can produce about 2 billion blocks in the z dimension.
However I can only use 1984 of these, and still get a proper image out, due to me not being able to properly index the kernel otherwise.

But here is the thing, I cannot get the data out properly from the camera, due to hardware constraints in the camera. And the supplier (Emergentvision) provided codes to rearrange the data, but those only work on the CPU, at slower rates.
And it has to be possible to rearrange the images on the GPU at a high rate, otherwise there would never be a market for high speed cameras that stream directly into the GPU, in CUDA capable buffers.

There has to be a way to rearrange these images correctly, otherwise they would have adjusted the hardware to not send the images in a scrambled format. The only reason i can think of why they did not, is that the rearrangement is trivial, for someone who knows CUDA far better than I do, and a higher frame rate can be achieved this way.

Below are the max number of blocks and threads on the GPU, which would explain why I am only using a fraction of its capabilities.

> Max threads per block: 1024
> Max threads in X dimension: 1024
> Max threads in Y dimension: 1024
> Max threads in Z dimension: 64
> Max number of blocks in X dimension: 2147483647
> Max number of blocks in Y dimension: 65535
> Max number of blocks in Z dimension: 65535

It is not going to help you now, but it may be time for some push back on the good folks that design the camera hardware.

War story: I used to be involved in the design of x86 processors. On one such project, one of the hardware engineers (more senior than me) declared a particular issue “too difficult and resource intensive to solve in hardware”, pushing to have it addressed in microcode. I happened to be the guy responsible for the microcode.

Despite lengthy and intense efforts on my part, the microcoded solution had poor performance. At design review time, the VP in charge of the project inquired about the lackluster performance of this portion of the processor design, and I explained why things were slow using microcode. VP decided that the hardware engineer should take another look at the hardware. Within less than 24 hours he came back with a triumphant “I found a very simple solution!”

This taught me a valuable lesson that one has to give pushback to HW folks, provided one has some leverage, or otherwise the old “we’ll address this later in software” tactic looks way too appealing.

BTW, the code snippets in your post are still not properly marked up as code, which means neither I nor various other forum participants will feel inclined to look at them. Use the </> button (pre-formatted text) to mark them up, not the '' button (block quote).

Thanks, for letting me know, I have adjusted how to quote the code snippets now.

But yes, I do think it is time to first ask the camera manufacturer about this issue.
It is mostly an indexing issue after all, because I also have to unpack the pixels from a packed format into 16 bit datapoints. (The camera has also the option to send the data in 16 bit chunks, but at a lower framerate).
And that process should be far mor computationally intense, but takes roughly 5% of the the kernell to rearrange takes.

In your kernel code there are two issues which lead to poor performance. Not enough parallelism and poor memory access pattern.

For coalesced memory access, adjacent threads should access adjacent memory locations.
=> use int index = blockIdx.x * blockDim.x + threadIdx.x; for the x-dimension of the image, not y-dimension.

For a greater degree of parallelism, launch more blocks. For example, use 1 thread per pixel.

Below is an untested kernel written in browser to give you an idea.

__global__ void kernel(void* arr1, void* arr2, void* vector1, size_t y_height, size_t x_width) {
    uint16_t* intArray = (uint16_t*)arr1;
    uint16_t* intArray2 = (uint16_t*)arr2;
    uint16_t* vector11 = (uint16_t*)vector1;
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    
    for(int outputRow = blockIdx.y; outputRow < y_height; outputRow += gridDim.y){

        int v11_idx = vector11[outputRow];
        int src_base_idx = x_width * v11_idx;
        int dst_base_idx = x_width * outputRow;
        for (int column = tid; column < x_width; column += stride) {
            intArray2[dst_base_idx+ column] = intArray[src_base_idx + column];
        }
    }
}

//calling the kernel
dim3 blocksize(128,1,1);
int numBlocksX = (x_width + 128 - 1) / 128;
int numBlocksY = y_height;
dim3 gridsize(numBlocksX, numBlocksY, 1);
kernel<<<gridsize, blocksize>>>(...)

There also exists cub::DeviceMemcpy::Batched to perform copies of multiple memory ranges given pairs of src and dst pointers. cub::DeviceMemcpy — cub 2.5 documentation

With the help of thrust’s transform iterator and constant iterator it should be straight forward to set this up.

cuFFTDx allows the FFT to be fused with other operations including how to read or write data: https://developer.nvidia.com/cufft#section-cufftdx

Always think in warps of 32 threads. They should access memory, which is continuous (and aligned to 128 bytes boundaries). Even vector11=vector1 should be accessed this way. You can use shared memory or shuffle operations to cache data.