Int Array initialization

I need to initialize an array with the max integer value. I tried doing this with cudaMemset, but it was too slow. I read somewhere that for large arrays a cuda kernel initializing each element was faster. However it is still too slow.

I’m trying to initialize an int array with around 400 000 elements the kernel is taking about 2 ms (in a notebook, with geforce 330M)


void kernel(bounding_box* b_box, int* depthArray, int value){

unsigned int width = b_box->getWidth();

    unsigned int height = b_box->getHeight();

    unsigned int bx = blockIdx.x;

    unsigned int by = blockIdx.y;

    int v = value;

// matrix indexes

    unsigned int line = by * blockDim.y + threadIdx.y;

    unsigned int column = bx * blockDim.x + threadIdx.x;

if(column >= width || line >= height){



int index = line * b_box->getWidth() + column;

    int* ptr = (int*)(depthArray + (index * MAX_DEPTH));

for(int i = 0; i < MAX_DEPTH; ++i){

        ptr[i] = v;


Each thread sets about 10 positions (MAX_DEPTH = 10)

Am I missing something, is there some approach that I’m not considering?



Your kernel is extremely inefficient as the memory accesses cannot be coalesced. Rearrange the loop so that consecutive threads access consecutive elements of [font=“Courier New”]depthArray[/font].

I thought that was only a problem when Reading memory !?

It applies for writing as well. Reading is just more common.

I changed my kernel, so that each thread set only one value:


void kernel(bounding_box* b_box, int* depthArray, int value){

    int index = blockDim.x * blockIdx.x + threadIdx.x;

if(index >= ((b_box->getSize()) * MAX_DEPTH))


depthArray[index] = value;



The kernel invocation is done using the following dimensions:

int totalSize = h_bound_box->getSize() * MAX_DEPTH;

    dim3 dimBlock(512, 1);

    dim3 dimGrid(ceil((float)(totalSize/dimBlock3.x)),1);

I think now the threads are accessing consecutive memory locations.

I must correct the number I gave on the first post: it’s not around 2ms, but actually around 5ms!!! The modification above enabled me to get that number down to around 3ms, which was already a good boost!

But still, I’m finding it too slow!

Can anyone give me some advice?

Thank you,


Opps wrong forum. Replaced by post
to CUDA development forum.