I need to initialize an array with the max integer value. I tried doing this with cudaMemset, but it was too slow. I read somewhere that for large arrays a cuda kernel initializing each element was faster. However it is still too slow.
I’m trying to initialize an int array with around 400 000 elements the kernel is taking about 2 ms (in a notebook, with geforce 330M)
__global__
void kernel(bounding_box* b_box, int* depthArray, int value){
unsigned int width = b_box->getWidth();
unsigned int height = b_box->getHeight();
unsigned int bx = blockIdx.x;
unsigned int by = blockIdx.y;
int v = value;
// matrix indexes
unsigned int line = by * blockDim.y + threadIdx.y;
unsigned int column = bx * blockDim.x + threadIdx.x;
if(column >= width || line >= height){
return;
}
int index = line * b_box->getWidth() + column;
int* ptr = (int*)(depthArray + (index * MAX_DEPTH));
for(int i = 0; i < MAX_DEPTH; ++i){
ptr[i] = v;
}
Each thread sets about 10 positions (MAX_DEPTH = 10)
Am I missing something, is there some approach that I’m not considering?
Your kernel is extremely inefficient as the memory accesses cannot be coalesced. Rearrange the loop so that consecutive threads access consecutive elements of [font=“Courier New”]depthArray[/font].
I think now the threads are accessing consecutive memory locations.
I must correct the number I gave on the first post: it’s not around 2ms, but actually around 5ms!!! The modification above enabled me to get that number down to around 3ms, which was already a good boost!