Quick Thread Question Regarding Calling a kernel

senorbum · June 19, 2008, 5:57pm

So if I have an array on the GPU with 1000x1000 elements in size, how am I to use the threadIdx’s to access each space in the array. Currently, if I call element in the array I use "arrayName[threadIdx.x + blockIdx.x]. however, if it is this large, I can’t have this many threads operating. What is the bypass for this? Do I have to divide it into sub-matrices?

Joe

Ailleur · June 19, 2008, 6:01pm

This is how i do it:

const dim3 dimBlock(192);

 Â int dim = ceil(sqrt((float)(DATA_W*DATA_H)/192.0f));

 Â const dim3 dimGrid(dim,dim);

const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

	const int y = idx/DATA_W;

	const int x = idx-y*DATA_W;

...

result[y * DATA_W + x] = (float)sum;

Maybe not the most intuitive way to work with something that would fit in a 2d array. I just prefer to work with 1d arrays and slice it myself. I find it is easier to get coalescing that way too since my width does not need to be a multiple of 16.

Maybe theres something really stupid, performance wise, in my way of doing it (please let me know!) but it works.

senorbum · June 19, 2008, 6:23pm

Technically I am using a 1d array, but I like to think of it in 2d form. I don’t really know why. The code you provided should help out quite a bit. Thank you much.

This might sound really noobish, but why use 192?

Ailleur · June 19, 2008, 6:45pm

Oh right, forget about that 192, that was just the value i was testing with at that moment. Set it to whatever you like!

senorbum · June 19, 2008, 6:55pm

Ok, awesome. I didn’t see how 192 aligned up with any memory patterns, so I was a bit confused there. So far your method seems to work.

Thanks :)

Joe

senorbum · June 19, 2008, 7:20pm

Ok, one quick question regarding its use.

So previously if I wanted to access the column and row of a certain data set, I could just use array[blockIdx.x] and array[threadIdx.x] and it would yield the correct answer. With the necessary offset, I’m just a bit confused as to how to recreate this.

for(int i = 0; i < nColumns; i++){
tempvar = covariance + ((cpuA[blockIdx.x+inColumns])(cpuA[threadIdx.x+i*nColumns]));
covariance = tempvar;
}

gpuB = covariance/(nColumns-1);
const int x = blockDim.x * blockIdx.x + threadIdx.x;

That is my previous code, if that clarifies things.

Ailleur · June 19, 2008, 8:04pm

Im not so sure how to answer this.
I guess a column is array[ywidth+x] where x is const and y goes from 0…height-1 and a column is array[ywidth+x] with y fixed and x 0…width-1.

Not sure if this is what youre looking for.

senorbum · June 19, 2008, 8:24pm

I figured it out after screwing around with some stuff for a while :-P

senorbum · June 23, 2008, 2:55pm

I find that this is no longer working after 1581x1581 elements (floats). The numbers that I input for the divisors make no difference. Am I missing something here?

Ailleur · June 23, 2008, 3:12pm

Must be somewhere else, the code i pasted it from uses 3000x3000.

senorbum · June 23, 2008, 3:32pm

Mmk, I’ll look into it.

senorbum · June 23, 2008, 3:39pm

int sum = 0;

//Create Arrays for CPU
float *cpuA;												//freed
float *cpuB;												//freed
float *cpuC;												//freed

//Create Arrays for the GPU
float *gpuA;												//freed
float *gpuB;												//freed
float *gpuC;												//freed

//Create Vectors for various functions
float *vector;												//freed
float *vector2;												//freed
float *meanVectorGPU;										//freed
float *meanVectorCPU;										//freed
//Create i, j for various loops
int i, j;

//Declare sizes for the arrays
int nRows = 2048;
int nColumns = 2048;

//Used for call to kernel, so that threads does not exceed 512
dim3 threads2(nColumns);
dim3 grid2(nColumns);
dim3 threads(nRows,nColumns);
dim3 grid(nRows,nColumns);
const dim3 dimBlock(1);
float divisor = ceil((float)nRows*(float)nColumns/256.0f)+1;
int dim = ceil(sqrt((float)(nColumns*nRows)/divisor));
const dim3 dimGrid(dim, dim);
//Create the items for the timer
unsigned int timer = 0;
unsigned int elapsed = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));

//Initialize cutil
[b]CUT_DEVICE_INIT();[/b]

That is the line that did it. Which confuses me. Granted it isn’t doing anything anymore to my knowledge, it was for something that used to be in my code. Still though, seems weird.

senorbum · June 23, 2008, 4:03pm

After taking that line out, I am able to run any size matrix (within memory constraints, of course). :-D

senorbum · June 26, 2008, 3:31pm

I feel really dumb. I didn’t even need to go through the trouble of having to use a dim3. All I needed to do was to call the kernel with <<<blocks, 512>>> where blocks equals ceil((numberColumns /512)nColumns. So for example, if I have a 1024x1024 matrix, blocks = 2nColumns or 2048. The ceiling just ensures that I don’t lose a block because of incorrect rounding.

Topic		Replies	Views
How to acess matrix elements that are part of a block of threads? CUDA Programming and Performance	3	1579	April 29, 2010
Fast reading of some array CUDA Programming and Performance	3	1543	December 17, 2009
how to use threads for A[1024][1024] threads CUDA Programming and Performance	16	5999	January 9, 2009
Urgent help with threads please! CUDA Programming and Performance	21	10817	March 6, 2008
Management of threads CUDA Programming and Performance	3	1842	March 28, 2010
I can access to only the first 8 elements of the array cannot acces to every element of the array CUDA Programming and Performance	5	5477	October 27, 2009
How to manipulate a 1D array of 128 million integers CUDA Programming and Performance	4	2265	April 22, 2009
Threads, Blocks & Grid in CUDA CUDA Programming and Performance	4	10496	October 4, 2011
Multiplying two arrays CUDA Programming and Performance	6	5228	May 7, 2008
Newbie help on thread blocks CUDA Programming and Performance	22	10685	December 24, 2008

Quick Thread Question Regarding Calling a kernel

Related topics