# Need help in GPU programming

I am doing research on compressive sensing on GPU. I am confused how to tackle the calculations on GPU with compressive sensing means which part of code I make kernel. Would you like to guide me in this situation?

A flow chart of my code is

Input : an image ,sensing matrix

Output: image

for 1:length(image)

// send each row at a time

1-compressed the signal(a row)

2-scaling

3-iterations

``````1-prune weights as their parameters go to zero

2-compute new weights

3-learn correlation structure in blocks

4-check stopping condition

end
``````

4-reconstruct original signal ( a row)

end

Now help me in deciding that whether a compressed signal (row) is input of my kernel or I make the iteration steps the separate kernels?

I shall be very thankful to your response

you have provided a flow chart, which is great
at the same time, you have to admit that it is still very vague - instead of a black-box, it is now a brown black-box

it would be difficult to determine how to effectively break up this problem, if it is unclear what each component does, and what its inputs are
perhaps note a one-line/ paragraph summary of each line of the flow chart

for example:

how big is a matrix?
are the rows and calculation on rows independent of each other?
what is an iteration?
learn correlation structure in blocks? ???

A flow chart of my code is

Input : an image (512 512),sensing matrix

Output: image (512,512)

for 1:length(image)

// send each row at a time (1,512)

1-compressed the signal(a row) (y=phi*x) (1,256)

2-scaling

3-loop until error become less than 0.00000008

1-compute new weights (after calculations , estimate the signal)

2-check correlation structure in blocks

3-check stopping condition

end

4-reconstruct original signal (1,512)

end

"
3-loop until error become less than 0.00000008"

is error determined per row, or across multiple rows?

“2-check correlation structure in blocks”

???

"3-check stopping condition "

???

“4-reconstruct original signal (1,512)”

???

i suppose the main question is: are calculations done on rows dependent on input from the row only, or other rows as well; can you process rows independently of each other, or are there dependencies between rows to be met

if not you would likely have 1 kernel, with kernel dimensions according to the number of rows (grid dimension), and the width of a row (block dimension)
otherwise, the dependencies would likely determine the proper division of labour

actualy I want to make a row as one thread. all rows work independently. there are some variables which have fix value. should I make those shared memory ?

in first step scaling, I wat to take standard deviation of whole row.then scale whole row with a constant number. after this , in a loop ,following equation executes
phiB (256,512)=phiB(256,512)+Phi(256,64)*sigma(64,64)*Phi’(64,256);
now tell me if I make single row a thread then this calculations can execute?

i would think that, a block per row, scheduled to run concurrently, would be faster than a thread per row, scheduled to run concurrently

hence,

dim3 dB(row_width, 1, 1)
dim3 dG(nr_of_rows, 1, 1)

should be faster than

dim3 dB(1, 1, 1)
dim3 dG(nr_of_rows, 1, 1)

or

dim3 dB(nr_of_rows, 1, 1)
dim3 dG(1, 1, 1)

i struggle to interpret (256, 512) - is that (row_element, col_element)…?

now I want to run a block per row. row has dimension
result(256,1)=phi(256,512)*u(1,512)’;
how can it be possible?

“an image (512 512),sensing matrix”

which i interpret as 512 rows; 512 elements per row

so where do (256, 1) and (256,512) come from?

“in first step scaling, I wat to take standard deviation of whole row.then scale whole row with a constant number.”

here, i propose using a thread block per row, and a thread block for each row (dB(row_elements, 1, 1); dG(nr_of_rows, 1, 1)

“after this , in a loop ,following equation executes
phiB (256,512)=phiB(256,512)+Phi(256,64)*sigma(64,64)*Phi’(64,256);”

can you write this loop out in a sentence, pseudo code or code, even if serial platform code, such that it is easier to follow
this seems to be more of a filter operation, which may require different kernel dimensions to adapt to the filter

an image x have size 512,512
I want a matrix multiplication of a matrix phi size 256,512 with transpose of each row of image of size 1,512…
this seems like
y(256,1)= phi(256,512) * image(1,512)’;
note that matrix phi is constant throughout the code.
now my question is that how many number of threads I should use for a block if I want to make a row ,a block ?

a kernel my be look like this?

``````int const i = blockDim.x * blockIdx.x + threadIdx.x;
``````

//N=256;
//a->phi;
//b->y;
shared int temp[N];
int sum=0;
for ( int i = 0; i < N; i++ )
sum += temp[i];
*c[i] = sum;
}

actually I want to process a whole kernel with one block at a time… because each row is independent of other row of image matrix. and each row should execute the same kernel. if it can happen that each row can access whole kernel parallel, it is much better.

“an image x have size 512,512
I want a matrix multiplication of a matrix phi size 256,512 with transpose of each row of image of size 1,512…
this seems like
y(256,1)= phi(256,512) * image(1,512)’;
note that matrix phi is constant throughout the code.”

what a big difference 1 paragraph can make

dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);

kernel k(type* image, type* phi, type* product)
{
i = (blockDim.x * blockIdx.x) + threadIdx.x;
j = (blockDim.x * blockIdx.y) + threadIdx.x;

shared type smem[//blockDim.x//];

{
// reduction scan of smem
}

{
product[(gridDim.x * blockIdx.y) + blockIdx.x] = smem[0];
}
}

just check

where I can write
dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);
because dim3 is used for grid dimention…
should kernel call like
dotproduct<<<1,512>>>(d_image,d_phi,d_pro);

{
dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);

dotproduct<<<dG,dB>>>(d_image,d_phi,d_pro);
}

you can also push each image row in shared memory for data reuse; i shall see if i can post an example later

sir what is
shared type smem[//blockDim.x//];

{
// reduction scan of smem
}
?

shared type smem[//blockDim.x//];

declaration and assignment of shared memory of type type

e.g. if image and phi are of type float, it would be

shared float smem[512];

a reduction is like a sum scan where the sum of an array is accumulated over the elements
a reduction however only minds the overall (last) sum, not each element’s accumulation
refer to the samples if you are uncertain as to how to program one

sir I can not understand how to keep the resulting array as (256,1) after multiplication… and which commands I should add in reduction scan of smem???

"and which commands I should add in reduction scan of smem??? "

(programming) scans, reductions, histrograms are parallel programming fundamentals

here is an elementary example of a reduction

https://devtalk.nvidia.com/default/topic/806598/cuda-programming-and-performance/-sum-of-all-pixels-in-an-image/

if you installed cuda, you should have the accompanying samples
refer to the scan and reduction samples
(seemingly, they are stored under 6_advanced; which is baffling - if a scan/ reduction is ‘advanced’, all hope is lost)

“I can not understand how to keep the resulting array as (256,1) after multiplication”

the result is stored in product, which in turn is a 1 dimensional array of 256 x 512 elements
the result arrays of (256, 1) are stored back to back in this array
given its stipulated dimensions, the kernel has 256 x 512 blocks; each block will write a single result value (a matrix row-column multiplication product) to product
that is parallel processing at its purest, if you think about it

In all examples of matrix multiplication,either dealing with same sizes of both matrics like (N*N) or user can define sizes in main of .c files…
but i am using mxGPU functions.and my code starts with mwxFunction rather than main.
I use every technique to give sizes by myself but cant succeed.
if i do like
result(256,512)=phi(256,512)*image(512,512)
this can’t give the proper result.