simple question


i’m new to GPU programming and parallelization. my previous knowlege focuses purely in C++ coding without adapting GPU resources; and now that i’m required to learn (downgrade from C++) C and adapt the CUDA model im stuck with alot of questions about coding in CUDA Api.

most importantly i’ve started to play around with some plain examples from the book and some source code from the nvidia website that they offer.
the general samples work fine (and are quite straightforward), but when i started adjusting and modifing them for the sake of experiment i’ve encountered many tedious details that are left unexplained.

next stage that i got confused with was the kernel coding:

particularly speaking:
i wanted to develop a test example to compare the power of GPU vs CPU on a simple sample structures (1D array and 2D arrays in my case). after i’ve done coding the example in plain C++ i’ve started coding the same example in C/CUDA.
briefly im going to explain the algorithm im trying to implement: basicly i have a single-dimentional array of ints that im trying to summ up with each “row” of a two-dimentional array of ints that must return an ouput two-dimentional array.

my simple kernel is:

global void sum_action (int* test, int* target, int* out)
int index = threadIdx.x + blockIdx.x * blockDim.x; // basic index
while (index < UN) //UN are the dimentions of my two-dimentional array defined as global values
out** [index] = test** [index] + target* [index]; // sum and store in the out array the elements in test
// with the elements of target array
// obviously a wrong approach that im strugling with
index += blockDim.x * gridDim.x; //increment the index

the following kernel gives me an “expected an expression” error in the calculation line of my kernel at the time of compilation.

any help regarding this issue would be appreciated. if you have any other suggestions on implementing the same calculation - i will gladly consider them. if im doing something wrong please inform me.

p.s. one more detail that might shed some light on the issue is that i use cudaMalloc on a 2D array - as i assume meaning that the access to the required data in the array requires a pitch (being the width - in my case the N global variable). (please correct me if im wrong) meaning the approach that i took using a pointer to an array is perhaps not the best solution.

after tweaking the kernel i got it to compile, but unfortunatelly, it returned wrong results that got me even more confused.

my NEW(revised) simple kernel:

global sum_action (int* test, int* target, int* out)
int tidx = threadIdx.x;
int index = blockIdx.x * blockDim.x + tidx;

while (index < U * N) //width and height of the 2D array
out [indexU+N] = test [indexU+N] + target [tidx+N];
index += blockDim.x * gridDim.x;

please tell me what im doing wrong here.

Just out of curiosity, what’s your kernel invocation’s thread and block dimensions? Since each thread will execute the kernel, depending on your dimensions you might get an answer, but it won’t be the one you’re expecting. As for the pitch, did you use cudaMalloc or cudaMallocPitch? Pitch (aka stride) is added to the memory in order to ensure it conforms to certain memory alignment requirements, so that your global memory accesses can be coalesced.

so my kernel invocation dimentions are <<<N,512>>
N being the size (width=32) of the array running on 512 threads per block.

kinda got past the dead point.
limiting the number of threads per block to 32 - runs the code exaclty how i need it.
now the missing link to data parallelization is to set the needed number of blocks. speed comparison gpr vs cpu turned out to be 1025ms vs 293ms respectivly running on the same dimentions of the arrays.
what am i doing wrong?

one thing i had to change was my kernel implementation:
i was kinda missleaded by the while statement; so i chose to use a for loop on the defined height:

global sum_action (int *test, int *target, int *out)
int tidx = threadIdx.x;
int index = blockIdx.x * blockDim.x + tidx;

for (int c = 0; c < U; c++)
out [index * N + c] = test [index * N + c] + tar [c];

seems to work, but my code ran almost 3,5 slower then ordinary cpu code.

minor correction: kernel instance only works on the first element (row) of the 2d array. after that it has no effect

after several attempts on solving the problem - i’ve figured out a way to actually compute the needed algorithm.

ok, and here are the results:

GPU: ~92 ms
CPU: ~290 ms