Quick Thread Question Regarding Calling a kernel

So if I have an array on the GPU with 1000x1000 elements in size, how am I to use the threadIdx’s to access each space in the array. Currently, if I call element in the array I use "arrayName[threadIdx.x + blockIdx.x]. however, if it is this large, I can’t have this many threads operating. What is the bypass for this? Do I have to divide it into sub-matrices?


This is how i do it:

const dim3 dimBlock(192);

 Â int dim = ceil(sqrt((float)(DATA_W*DATA_H)/192.0f));

 Â const dim3 dimGrid(dim,dim);
const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;

	const int y = idx/DATA_W;

	const int x = idx-y*DATA_W;


result[y * DATA_W + x] = (float)sum;

Maybe not the most intuitive way to work with something that would fit in a 2d array. I just prefer to work with 1d arrays and slice it myself. I find it is easier to get coalescing that way too since my width does not need to be a multiple of 16.

Maybe theres something really stupid, performance wise, in my way of doing it (please let me know!) but it works.

Technically I am using a 1d array, but I like to think of it in 2d form. I don’t really know why. The code you provided should help out quite a bit. Thank you much.

This might sound really noobish, but why use 192?

Oh right, forget about that 192, that was just the value i was testing with at that moment. Set it to whatever you like!

Ok, awesome. I didn’t see how 192 aligned up with any memory patterns, so I was a bit confused there. So far your method seems to work.

Thanks :)


Ok, one quick question regarding its use.

So previously if I wanted to access the column and row of a certain data set, I could just use array[blockIdx.x] and array[threadIdx.x] and it would yield the correct answer. With the necessary offset, I’m just a bit confused as to how to recreate this.

for(int i = 0; i < nColumns; i++){
tempvar = covariance + ((cpuA[blockIdx.x+inColumns])(cpuA[threadIdx.x+i*nColumns]));
covariance = tempvar;

gpuB = covariance/(nColumns-1);
const int x = blockDim.x * blockIdx.x + threadIdx.x;

That is my previous code, if that clarifies things.

Im not so sure how to answer this.
I guess a column is array[ywidth+x] where x is const and y goes from 0…height-1 and a column is array[ywidth+x] with y fixed and x 0…width-1.

Not sure if this is what youre looking for.

I figured it out after screwing around with some stuff for a while :-P

I find that this is no longer working after 1581x1581 elements (floats). The numbers that I input for the divisors make no difference. Am I missing something here?

Must be somewhere else, the code i pasted it from uses 3000x3000.

Mmk, I’ll look into it.

int sum = 0;

//Create Arrays for CPU
float *cpuA;												//freed
float *cpuB;												//freed
float *cpuC;												//freed

//Create Arrays for the GPU
float *gpuA;												//freed
float *gpuB;												//freed
float *gpuC;												//freed

//Create Vectors for various functions
float *vector;												//freed
float *vector2;												//freed
float *meanVectorGPU;										//freed
float *meanVectorCPU;										//freed
//Create i, j for various loops
int i, j;

//Declare sizes for the arrays
int nRows = 2048;
int nColumns = 2048;

//Used for call to kernel, so that threads does not exceed 512
dim3 threads2(nColumns);
dim3 grid2(nColumns);
dim3 threads(nRows,nColumns);
dim3 grid(nRows,nColumns);
const dim3 dimBlock(1);
float divisor = ceil((float)nRows*(float)nColumns/256.0f)+1;
int dim = ceil(sqrt((float)(nColumns*nRows)/divisor));
const dim3 dimGrid(dim, dim);
//Create the items for the timer
unsigned int timer = 0;
unsigned int elapsed = 0;

//Initialize cutil

That is the line that did it. Which confuses me. Granted it isn’t doing anything anymore to my knowledge, it was for something that used to be in my code. Still though, seems weird.

After taking that line out, I am able to run any size matrix (within memory constraints, of course). :-D

I feel really dumb. I didn’t even need to go through the trouble of having to use a dim3. All I needed to do was to call the kernel with <<<blocks, 512>>> where blocks equals ceil((numberColumns /512)nColumns. So for example, if I have a 1024x1024 matrix, blocks = 2nColumns or 2048. The ceiling just ensures that I don’t lose a block because of incorrect rounding.