Int Array initialization

thyandrecardoso · December 30, 2010, 7:01pm

I need to initialize an array with the max integer value. I tried doing this with cudaMemset, but it was too slow. I read somewhere that for large arrays a cuda kernel initializing each element was faster. However it is still too slow.

I’m trying to initialize an int array with around 400 000 elements the kernel is taking about 2 ms (in a notebook, with geforce 330M)

__global__

void kernel(bounding_box* b_box, int* depthArray, int value){

unsigned int width = b_box->getWidth();

    unsigned int height = b_box->getHeight();

    unsigned int bx = blockIdx.x;

    unsigned int by = blockIdx.y;

    int v = value;

// matrix indexes

    unsigned int line = by * blockDim.y + threadIdx.y;

    unsigned int column = bx * blockDim.x + threadIdx.x;

if(column >= width || line >= height){

        return;

    }

int index = line * b_box->getWidth() + column;

    int* ptr = (int*)(depthArray + (index * MAX_DEPTH));

for(int i = 0; i < MAX_DEPTH; ++i){

        ptr[i] = v;

    }

Each thread sets about 10 positions (MAX_DEPTH = 10)

Am I missing something, is there some approach that I’m not considering?

Thanks,

AndrÃ©

tera · December 30, 2010, 8:25pm

Your kernel is extremely inefficient as the memory accesses cannot be coalesced. Rearrange the loop so that consecutive threads access consecutive elements of [font=“Courier New”]depthArray[/font].

thyandrecardoso · December 30, 2010, 8:52pm

I thought that was only a problem when Reading memory !?

tera · December 31, 2010, 1:18am

It applies for writing as well. Reading is just more common.

thyandrecardoso · January 10, 2011, 2:25pm

I changed my kernel, so that each thread set only one value:

__global__

void kernel(bounding_box* b_box, int* depthArray, int value){

    int index = blockDim.x * blockIdx.x + threadIdx.x;

if(index >= ((b_box->getSize()) * MAX_DEPTH))

        return;

depthArray[index] = value;

    return;

}

The kernel invocation is done using the following dimensions:

int totalSize = h_bound_box->getSize() * MAX_DEPTH;

    dim3 dimBlock(512, 1);

    dim3 dimGrid(ceil((float)(totalSize/dimBlock3.x)),1);

I think now the threads are accessing consecutive memory locations.

I must correct the number I gave on the first post: it’s not around 2ms, but actually around 5ms!!! The modification above enabled me to get that number down to around 3ms, which was already a good boost!

But still, I’m finding it too slow!

Can anyone give me some advice?

Thank you,

AndrÃ©

wlangdon · March 18, 2011, 11:36am

Opps wrong forum. Replaced by post

to CUDA development forum.

Bill

Topic		Replies	Views
cudaMemset() CUDA Programming and Performance	6	19719	November 26, 2009
Setting arrays to a value Float arrays CUDA Programming and Performance	2	16367	July 28, 2008
I'm just not getting it ... CUDA Programming and Performance	2	4792	March 19, 2009
[Need an advice] [Kernel] Initialize array Does "int myArray [7] = {0};" work ? CUDA Programming and Performance	2	1883	March 29, 2011
fastest way to initialise large arrays cudaMemset v cudaMemcpyDeviceToDevice CUDA Programming and Performance	7	17922	March 22, 2011
How to set the variables in the global memory to zero effectively? initialize global memory CUDA Programming and Performance	5	3927	March 24, 2009
Kernel execution takes AGES CUDA Programming and Performance	7	3078	March 28, 2012
Fast reading of some array CUDA Programming and Performance	3	1602	December 17, 2009
Are there memory limitations on Device when using large arrays? Tesla C1060 CUDA Programming and Performance	40	15255	April 22, 2009
C style array initialization inside a cuda kernel - or ... one of these is not like the other CUDA Programming and Performance	0	1280	July 21, 2010

Int Array initialization

Related topics