simple question

santarus · August 1, 2011, 9:08am

hello!

i’m new to GPU programming and parallelization. my previous knowlege focuses purely in C++ coding without adapting GPU resources; and now that i’m required to learn (downgrade from C++) C and adapt the CUDA model im stuck with alot of questions about coding in CUDA Api.

most importantly i’ve started to play around with some plain examples from the book and some source code from the nvidia website that they offer.
the general samples work fine (and are quite straightforward), but when i started adjusting and modifing them for the sake of experiment i’ve encountered many tedious details that are left unexplained.

next stage that i got confused with was the kernel coding:

particularly speaking:
i wanted to develop a test example to compare the power of GPU vs CPU on a simple sample structures (1D array and 2D arrays in my case). after i’ve done coding the example in plain C++ i’ve started coding the same example in C/CUDA.
briefly im going to explain the algorithm im trying to implement: basicly i have a single-dimentional array of ints that im trying to summ up with each “row” of a two-dimentional array of ints that must return an ouput two-dimentional array.

my simple kernel is:

global void sum_action (int* test, int* target, int* out)
{
int index = threadIdx.x + blockIdx.x * blockDim.x; // basic index
while (index < UN) //UN are the dimentions of my two-dimentional array defined as global values
{
out** [index] = test** [index] + target* [index]; // sum and store in the out array the elements in test
// with the elements of target array
// obviously a wrong approach that im strugling with
index += blockDim.x * gridDim.x; //increment the index
}
}

the following kernel gives me an “expected an expression” error in the calculation line of my kernel at the time of compilation.

any help regarding this issue would be appreciated. if you have any other suggestions on implementing the same calculation - i will gladly consider them. if im doing something wrong please inform me.

p.s. one more detail that might shed some light on the issue is that i use cudaMalloc on a 2D array - as i assume meaning that the access to the required data in the array requires a pitch (being the width - in my case the N global variable). (please correct me if im wrong) meaning the approach that i took using a pointer to an array is perhaps not the best solution.

santarus · August 1, 2011, 12:33pm

after tweaking the kernel i got it to compile, but unfortunatelly, it returned wrong results that got me even more confused.

my NEW(revised) simple kernel:

global sum_action (int* test, int* target, int* out)
{
int tidx = threadIdx.x;
int index = blockIdx.x * blockDim.x + tidx;

while (index < U * N) //width and height of the 2D array
{
out [indexU+N] = test [indexU+N] + target [tidx+N];
index += blockDim.x * gridDim.x;
}
}

please tell me what im doing wrong here.
regards.

alrikai · August 1, 2011, 6:15pm

Just out of curiosity, what’s your kernel invocation’s thread and block dimensions? Since each thread will execute the kernel, depending on your dimensions you might get an answer, but it won’t be the one you’re expecting. As for the pitch, did you use cudaMalloc or cudaMallocPitch? Pitch (aka stride) is added to the memory in order to ensure it conforms to certain memory alignment requirements, so that your global memory accesses can be coalesced.

santarus · August 2, 2011, 5:39am

ok,
so my kernel invocation dimentions are <<<N,512>>
N being the size (width=32) of the array running on 512 threads per block.

santarus · August 2, 2011, 5:49am

kinda got past the dead point.
limiting the number of threads per block to 32 - runs the code exaclty how i need it.
now the missing link to data parallelization is to set the needed number of blocks. speed comparison gpr vs cpu turned out to be 1025ms vs 293ms respectivly running on the same dimentions of the arrays.
what am i doing wrong?

one thing i had to change was my kernel implementation:
i was kinda missleaded by the while statement; so i chose to use a for loop on the defined height:

global sum_action (int *test, int *target, int *out)
{
int tidx = threadIdx.x;
int index = blockIdx.x * blockDim.x + tidx;

for (int c = 0; c < U; c++)
{
out [index * N + c] = test [index * N + c] + tar [c];
}
}

seems to work, but my code ran almost 3,5 slower then ordinary cpu code.

minor correction: kernel instance only works on the first element (row) of the 2d array. after that it has no effect

santarus · August 2, 2011, 12:08pm

after several attempts on solving the problem - i’ve figured out a way to actually compute the needed algorithm.

ok, and here are the results:

GPU: ~92 ms
CPU: ~290 ms

Topic		Replies	Views
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10265	June 5, 2009
Problem with kernel output CUDA Programming and Performance	5	1101	February 18, 2016
Parallel reduction problem CUDA Programming and Performance	1	5082	November 29, 2010
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	233	July 7, 2024
CUDA - calculation of a sum CUDA Programming and Performance	7	5506	April 30, 2010
Parallelism On Multiple Blocks Seems Broken CUDA Programming and Performance	7	823	October 12, 2021
CUDA principals - summary CUDA Programming and Performance	0	335	September 1, 2018
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6547	February 19, 2009
Basic reduction with CUDA CUDA Programming and Performance	1	464	March 22, 2018
CUDA and fixed-point comparaison on big array Is CUDA suitable for fixed-point comparaison? CUDA Programming and Performance	7	2498	May 9, 2011

simple question

Related topics