hello!
i’m new to GPU programming and parallelization. my previous knowlege focuses purely in C++ coding without adapting GPU resources; and now that i’m required to learn (downgrade from C++) C and adapt the CUDA model im stuck with alot of questions about coding in CUDA Api.
most importantly i’ve started to play around with some plain examples from the book and some source code from the nvidia website that they offer.
the general samples work fine (and are quite straightforward), but when i started adjusting and modifing them for the sake of experiment i’ve encountered many tedious details that are left unexplained.
next stage that i got confused with was the kernel coding:
particularly speaking:
i wanted to develop a test example to compare the power of GPU vs CPU on a simple sample structures (1D array and 2D arrays in my case). after i’ve done coding the example in plain C++ i’ve started coding the same example in C/CUDA.
briefly im going to explain the algorithm im trying to implement: basicly i have a single-dimentional array of ints that im trying to summ up with each “row” of a two-dimentional array of ints that must return an ouput two-dimentional array.
my simple kernel is:
global void sum_action (int* test, int* target, int* out)
{
int index = threadIdx.x + blockIdx.x * blockDim.x; // basic index
while (index < UN) //UN are the dimentions of my two-dimentional array defined as global values
{
out** [index] = test** [index] + target* [index]; // sum and store in the out array the elements in test
// with the elements of target array
// obviously a wrong approach that im strugling with
index += blockDim.x * gridDim.x; //increment the index
}
}
the following kernel gives me an “expected an expression” error in the calculation line of my kernel at the time of compilation.
any help regarding this issue would be appreciated. if you have any other suggestions on implementing the same calculation - i will gladly consider them. if im doing something wrong please inform me.
p.s. one more detail that might shed some light on the issue is that i use cudaMalloc on a 2D array - as i assume meaning that the access to the required data in the array requires a pitch (being the width - in my case the N global variable). (please correct me if im wrong) meaning the approach that i took using a pointer to an array is perhaps not the best solution.