Need help in GPU programming

zainub · January 19, 2015, 5:05pm

I am doing research on compressive sensing on GPU. I am confused how to tackle the calculations on GPU with compressive sensing means which part of code I make kernel. Would you like to guide me in this situation?

A flow chart of my code is

Input : an image ,sensing matrix

Output: image

for 1:length(image)

// send each row at a time

1-compressed the signal(a row)

2-scaling

3-iterations

1-prune weights as their parameters go to zero

2-compute new weights

3-learn correlation structure in blocks

4-check stopping condition

end

4-reconstruct original signal ( a row)

end

Now help me in deciding that whether a compressed signal (row) is input of my kernel or I make the iteration steps the separate kernels?

I shall be very thankful to your response

little_jimmy · January 20, 2015, 4:53am

you have provided a flow chart, which is great
at the same time, you have to admit that it is still very vague - instead of a black-box, it is now a brown black-box

it would be difficult to determine how to effectively break up this problem, if it is unclear what each component does, and what its inputs are
perhaps note a one-line/ paragraph summary of each line of the flow chart

for example:

how big is a matrix?
are the rows and calculation on rows independent of each other?
what is an iteration?
learn correlation structure in blocks? ???

zainub · January 20, 2015, 5:47am

A flow chart of my code is

Input : an image (512 512),sensing matrix

Output: image (512,512)

for 1:length(image)

// send each row at a time (1,512)

1-compressed the signal(a row) (y=phi*x) (1,256)

2-scaling

3-loop until error become less than 0.00000008

1-compute new weights (after calculations , estimate the signal)

2-check correlation structure in blocks

3-check stopping condition

end

4-reconstruct original signal (1,512)

end

little_jimmy · January 20, 2015, 6:32am

"
3-loop until error become less than 0.00000008"

is error determined per row, or across multiple rows?

“2-check correlation structure in blocks”

???

"3-check stopping condition "

???

“4-reconstruct original signal (1,512)”

???

i suppose the main question is: are calculations done on rows dependent on input from the row only, or other rows as well; can you process rows independently of each other, or are there dependencies between rows to be met

if not you would likely have 1 kernel, with kernel dimensions according to the number of rows (grid dimension), and the width of a row (block dimension)
otherwise, the dependencies would likely determine the proper division of labour

zainub · January 22, 2015, 9:19am

actualy I want to make a row as one thread. all rows work independently. there are some variables which have fix value. should I make those shared memory ?

zainub · January 22, 2015, 9:34am

in first step scaling, I wat to take standard deviation of whole row.then scale whole row with a constant number. after this , in a loop ,following equation executes
phiB (256,512)=phiB(256,512)+Phi(256,64)*sigma(64,64)*Phi’(64,256);
now tell me if I make single row a thread then this calculations can execute?

little_jimmy · January 22, 2015, 10:18am

i would think that, a block per row, scheduled to run concurrently, would be faster than a thread per row, scheduled to run concurrently

hence,

dim3 dB(row_width, 1, 1)
dim3 dG(nr_of_rows, 1, 1)

should be faster than

dim3 dB(1, 1, 1)
dim3 dG(nr_of_rows, 1, 1)

or

dim3 dB(nr_of_rows, 1, 1)
dim3 dG(1, 1, 1)

i struggle to interpret (256, 512) - is that (row_element, col_element)…?

zainub · January 23, 2015, 6:01am

now I want to run a block per row. row has dimension
result(256,1)=phi(256,512)*u(1,512)';
how can it be possible?

zainub · January 23, 2015, 6:03am

little_jimmy · January 23, 2015, 7:59am

“an image (512 512),sensing matrix”

which i interpret as 512 rows; 512 elements per row

so where do (256, 1) and (256,512) come from?

“in first step scaling, I wat to take standard deviation of whole row.then scale whole row with a constant number.”

here, i propose using a thread block per row, and a thread block for each row (dB(row_elements, 1, 1); dG(nr_of_rows, 1, 1)

“after this , in a loop ,following equation executes
phiB (256,512)=phiB(256,512)+Phi(256,64)*sigma(64,64)*Phi’(64,256);”

can you write this loop out in a sentence, pseudo code or code, even if serial platform code, such that it is easier to follow
this seems to be more of a filter operation, which may require different kernel dimensions to adapt to the filter

zainub · January 26, 2015, 5:26pm

an image x have size 512,512
I want a matrix multiplication of a matrix phi size 256,512 with transpose of each row of image of size 1,512…
this seems like
y(256,1)= phi(256,512) * image(1,512)';
note that matrix phi is constant throughout the code.
now my question is that how many number of threads I should use for a block if I want to make a row ,a block ?

a kernel my be look like this?

int const i = blockDim.x * blockIdx.x + threadIdx.x;

//N=256;
//a->phi;
//b->y;
shared int temp[N];
int sum=0;
for ( int i = 0; i < N; i++ )
temp[threadIdx.x] = phi[i][threadIdx.x] * y[threadIdx.x];
__syncthreads();
sum += temp[i];
*c[i] = sum;
}

zainub · January 26, 2015, 5:36pm

actually I want to process a whole kernel with one block at a time… because each row is independent of other row of image matrix. and each row should execute the same kernel. if it can happen that each row can access whole kernel parallel, it is much better.

little_jimmy · January 27, 2015, 5:58am

“an image x have size 512,512
I want a matrix multiplication of a matrix phi size 256,512 with transpose of each row of image of size 1,512…
this seems like
y(256,1)= phi(256,512) * image(1,512)';
note that matrix phi is constant throughout the code.”

what a big difference 1 paragraph can make

dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);

kernel k(type* image, type* phi, type* product)
{
i = (blockDim.x * blockIdx.x) + threadIdx.x;
j = (blockDim.x * blockIdx.y) + threadIdx.x;

shared type smem[//blockDim.x//];

smem[threadIdx.x] = image[j] * phi[i];

{
// reduction scan of smem
}

if (threadIdx.x == 0)
{
product[(gridDim.x * blockIdx.y) + blockIdx.x] = smem[0];
}
}

just check

zainub · January 27, 2015, 6:12am

where I can write
dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);
because dim3 is used for grid dimention…
should kernel call like
dotproduct<<<1,512>>>(d_image,d_phi,d_pro);

little_jimmy · January 27, 2015, 6:21am

{
dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);

dotproduct<<<dG,dB>>>(d_image,d_phi,d_pro);
}

you can also push each image row in shared memory for data reuse; i shall see if i can post an example later

zainub · January 27, 2015, 6:55am

sir what is
shared type smem[//blockDim.x//];

smem[threadIdx.x] = image[j] * phi[i];

{
// reduction scan of smem
}
?

little_jimmy · January 27, 2015, 7:48am

shared type smem[//blockDim.x//];

smem[threadIdx.x] = image[j] * phi[i];

declaration and assignment of shared memory of type type

e.g. if image and phi are of type float, it would be

shared float smem[512];

a reduction is like a sum scan where the sum of an array is accumulated over the elements
a reduction however only minds the overall (last) sum, not each element’s accumulation
refer to the samples if you are uncertain as to how to program one

zainub · January 27, 2015, 9:26am

sir I can not understand how to keep the resulting array as (256,1) after multiplication… and which commands I should add in reduction scan of smem???

little_jimmy · January 27, 2015, 10:07am

"and which commands I should add in reduction scan of smem??? "

(programming) scans, reductions, histrograms are parallel programming fundamentals

here is an elementary example of a reduction

[url]https://devtalk.nvidia.com/default/topic/806598/cuda-programming-and-performance/-sum-of-all-pixels-in-an-image/[/url]

if you installed cuda, you should have the accompanying samples
refer to the scan and reduction samples
(seemingly, they are stored under 6_advanced; which is baffling - if a scan/ reduction is ‘advanced’, all hope is lost)

“I can not understand how to keep the resulting array as (256,1) after multiplication”

the result is stored in product, which in turn is a 1 dimensional array of 256 x 512 elements
the result arrays of (256, 1) are stored back to back in this array
given its stipulated dimensions, the kernel has 256 x 512 blocks; each block will write a single result value (a matrix row-column multiplication product) to product
that is parallel processing at its purest, if you think about it

zainub · January 28, 2015, 5:38am

In all examples of matrix multiplication,either dealing with same sizes of both matrics like (N*N) or user can define sizes in main of .c files…
but i am using mxGPU functions.and my code starts with mwxFunction rather than main.
I use every technique to give sizes by myself but cant succeed.
if i do like
result(256,512)=phi(256,512)*image(512,512)
this can’t give the proper result.

Topic		Replies	Views
Efficiently loading into smem, with divergent branches Interesting problem, maybe possible solution? CUDA Programming and Performance	8	5322	May 14, 2009
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16345	January 30, 2011
CUDA reduction CUDA Programming and Performance	10	51437	June 7, 2009
Performance issues on memory transfer CUDA Programming and Performance	13	12985	November 26, 2010
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3650	February 1, 2016
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	678	September 28, 2018
sequential sum within a kernel. CUDA Programming and Performance	23	4991	September 8, 2008
Reduction CUDA Programming and Performance	19	3484	May 16, 2012
Optimization suggestions for reading from main memory to registers and share memory CUDA Programming and Performance	10	177	May 22, 2024
GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell Technical Blog	10	813	February 26, 2020

Need help in GPU programming

Related topics