So if I have an array on the GPU with 1000x1000 elements in size, how am I to use the threadIdx’s to access each space in the array. Currently, if I call element in the array I use "arrayName[threadIdx.x + blockIdx.x]. however, if it is this large, I can’t have this many threads operating. What is the bypass for this? Do I have to divide it into sub-matrices?
const dim3 dimBlock(192);
 int dim = ceil(sqrt((float)(DATA_W*DATA_H)/192.0f));
 const dim3 dimGrid(dim,dim);
const int idx = (blockIdx.y*blockDim.x*gridDim.x)+blockIdx.x*blockDim.x+threadIdx.x;
const int y = idx/DATA_W;
const int x = idx-y*DATA_W;
...
result[y * DATA_W + x] = (float)sum;
Maybe not the most intuitive way to work with something that would fit in a 2d array. I just prefer to work with 1d arrays and slice it myself. I find it is easier to get coalescing that way too since my width does not need to be a multiple of 16.
Maybe theres something really stupid, performance wise, in my way of doing it (please let me know!) but it works.
Technically I am using a 1d array, but I like to think of it in 2d form. I don’t really know why. The code you provided should help out quite a bit. Thank you much.
So previously if I wanted to access the column and row of a certain data set, I could just use array[blockIdx.x] and array[threadIdx.x] and it would yield the correct answer. With the necessary offset, I’m just a bit confused as to how to recreate this.
for(int i = 0; i < nColumns; i++){
tempvar = covariance + ((cpuA[blockIdx.x+inColumns])(cpuA[threadIdx.x+i*nColumns]));
covariance = tempvar;
}
gpuB = covariance/(nColumns-1);
const int x = blockDim.x * blockIdx.x + threadIdx.x;
That is my previous code, if that clarifies things.
Im not so sure how to answer this.
I guess a column is array[ywidth+x] where x is const and y goes from 0…height-1 and a column is array[ywidth+x] with y fixed and x 0…width-1.
I find that this is no longer working after 1581x1581 elements (floats). The numbers that I input for the divisors make no difference. Am I missing something here?
//Create Arrays for CPU
float *cpuA; //freed
float *cpuB; //freed
float *cpuC; //freed
//Create Arrays for the GPU
float *gpuA; //freed
float *gpuB; //freed
float *gpuC; //freed
//Create Vectors for various functions
float *vector; //freed
float *vector2; //freed
float *meanVectorGPU; //freed
float *meanVectorCPU; //freed
//Create i, j for various loops
int i, j;
//Declare sizes for the arrays
int nRows = 2048;
int nColumns = 2048;
//Used for call to kernel, so that threads does not exceed 512
dim3 threads2(nColumns);
dim3 grid2(nColumns);
dim3 threads(nRows,nColumns);
dim3 grid(nRows,nColumns);
const dim3 dimBlock(1);
float divisor = ceil((float)nRows*(float)nColumns/256.0f)+1;
int dim = ceil(sqrt((float)(nColumns*nRows)/divisor));
const dim3 dimGrid(dim, dim);
//Create the items for the timer
unsigned int timer = 0;
unsigned int elapsed = 0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
CUT_SAFE_CALL(cutStartTimer(timer));
//Initialize cutil
[b]CUT_DEVICE_INIT();[/b]
That is the line that did it. Which confuses me. Granted it isn’t doing anything anymore to my knowledge, it was for something that used to be in my code. Still though, seems weird.
I feel really dumb. I didn’t even need to go through the trouble of having to use a dim3. All I needed to do was to call the kernel with <<<blocks, 512>>> where blocks equals ceil((numberColumns /512)nColumns. So for example, if I have a 1024x1024 matrix, blocks = 2nColumns or 2048. The ceiling just ensures that I don’t lose a block because of incorrect rounding.