Hi guys,
I am currently writing a code in cuda that rearranges a large 1 D array.
This is part of a bigger code where we stream data from a camera straight into the GPU, does data processing in the GPU, and then saves the processed data. This step is necessary since the camera produces 8 GB per second, making saving the raw data not a favourable option.
The issue is that the raw data (from the camera) is scrambled, and the rows have to be rearranged according to a certain pattern before it can be used.
Currently i have three different codes to do so (actual codes included below):
A cuda kernell that code can handle about 1 gigabbyte per second of image data.
__global__ void rearrangetosave(void* arr1, void* arr2, void* vector1, size_t y_height, size_t x_width) {
uint16_t* intArray = (uint16_t*)arr1;
uint16_t* intArray2 = (uint16_t*)arr2;
uint16_t* vector11 = (uint16_t*)vector1;
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < y_height; i += stride) {
int v11_idx = vector11[i];
int src_base_idx = x_width * v11_idx;
int dst_base_idx = x_width * i;
for (int j = 0; j < x_width; j++) {
intArray2[dst_base_idx+ j] = intArray[src_base_idx + j];
}
}
}
- This version uses cudaMemcpy and can handle about 1.4 gigabyte per second of image data.
for (int m = 0; m < microscopeheight; m++) {
int start = microscopewidth * vectoryoumade[m];
int end = start + microscopewidth;
int startArranged = m * microscopewidth;
cudaMemcpy((uint16_t*)unpacked_triggeron +startArranged , (uint16_t*)start , microscopewidth * 2, cudaMemcpyDeviceToDevice);
}
- A loop that uses sdt::vector copy on the cpu that can handle about 5.6 gigabyte of imaging data per second:
for (int m = 0; m < microscopeheight; m++) {
int start = microscopewidth * vectoryoumade[m];
int end = start + microscopewidth;
int startArranged = m * microscopewidth;
std::copy(arraytorearrange.begin() + start, arraytorearrange.begin() +end, rearrangedarray.begin() +startArranged);
}
The problem i am facing is that we would prefer to keep the data analysis on the gpu, as the next step is to do fourier transform. However, our current codes are too slow at the rearranging process.
My question is, how can i either make cuda copy an entire row in the kernel, instead of going data point by datapoint? Or how can i multithread the process more efficiently, and increase the speed that way?
Currently, I can only multithread one of the two loops, without running into trouble.
Additionally, this is how the vector is produce, and images used are 1984 pixels by 1984 pixels.
std::vector<uint16_t> vector;
std::vector<uint16_t> vector2;
int my_array[] = { 2,18,6,22,3,19,7,23,10,26,14,30,11,27,15,31 };
int my_array2[] = { 0,16,4,20,1,17,5,21,8,24,12,28,9,25,13,29 };
for (int N = 0; N < microscopeheight / 32; N++) {
for (int i = 0; i < 16; ++i) {
vector.push_back(N * 32 + my_array[i]);
vector2.push_back(N * 32 + my_array2[i]);
}
}
std::reverse(vectoryoumade2.data(), vectoryoumade2.data() + vectoryoumade2.size());
vector.insert(vector.end(), vector2.begin(), vector2.end());