I am doing research on compressive sensing on GPU. I am confused how to tackle the calculations on GPU with compressive sensing means which part of code I make kernel. Would you like to guide me in this situation?

A flow chart of my code is

Input : an image ,sensing matrix

Output: image

for 1:length(image)

// send each row at a time

1-compressed the signal(a row)

2-scaling

3-iterations

1-prune weights as their parameters go to zero
2-compute new weights
3-learn correlation structure in blocks
4-check stopping condition
end

4-reconstruct original signal ( a row)

end

Now help me in deciding that whether a compressed signal (row) is input of my kernel or I make the iteration steps the separate kernels?

you have provided a flow chart, which is great
at the same time, you have to admit that it is still very vague - instead of a black-box, it is now a brown black-box

it would be difficult to determine how to effectively break up this problem, if it is unclear what each component does, and what its inputs are
perhaps note a one-line/ paragraph summary of each line of the flow chart

for example:

how big is a matrix?
are the rows and calculation on rows independent of each other?
what is an iteration?
learn correlation structure in blocks? ???

is error determined per row, or across multiple rows?

“2-check correlation structure in blocks”

???

"3-check stopping condition "

???

“4-reconstruct original signal (1,512)”

???

i suppose the main question is: are calculations done on rows dependent on input from the row only, or other rows as well; can you process rows independently of each other, or are there dependencies between rows to be met

if not you would likely have 1 kernel, with kernel dimensions according to the number of rows (grid dimension), and the width of a row (block dimension)
otherwise, the dependencies would likely determine the proper division of labour

actualy I want to make a row as one thread. all rows work independently. there are some variables which have fix value. should I make those shared memory ?

in first step scaling, I wat to take standard deviation of whole row.then scale whole row with a constant number. after this , in a loop ,following equation executes
phiB (256,512)=phiB(256,512)+Phi(256,64)*sigma(64,64)*Phi’(64,256);
now tell me if I make single row a thread then this calculations can execute?

which i interpret as 512 rows; 512 elements per row

so where do (256, 1) and (256,512) come from?

“in first step scaling, I wat to take standard deviation of whole row.then scale whole row with a constant number.”

here, i propose using a thread block per row, and a thread block for each row (dB(row_elements, 1, 1); dG(nr_of_rows, 1, 1)

“after this , in a loop ,following equation executes
phiB (256,512)=phiB(256,512)+Phi(256,64)*sigma(64,64)*Phi’(64,256);”

can you write this loop out in a sentence, pseudo code or code, even if serial platform code, such that it is easier to follow
this seems to be more of a filter operation, which may require different kernel dimensions to adapt to the filter

an image x have size 512,512
I want a matrix multiplication of a matrix phi size 256,512 with transpose of each row of image of size 1,512…
this seems like
y(256,1)= phi(256,512) * image(1,512)’;
note that matrix phi is constant throughout the code.
now my question is that how many number of threads I should use for a block if I want to make a row ,a block ?

a kernel my be look like this?

int const i = blockDim.x * blockIdx.x + threadIdx.x;

//N=256;
//a->phi;
//b->y; shared int temp[N];
int sum=0;
for ( int i = 0; i < N; i++ )
temp[threadIdx.x] = phi[i][threadIdx.x] * y[threadIdx.x];
__syncthreads();
sum += temp[i];
*c[i] = sum;
}

actually I want to process a whole kernel with one block at a time… because each row is independent of other row of image matrix. and each row should execute the same kernel. if it can happen that each row can access whole kernel parallel, it is much better.

“an image x have size 512,512
I want a matrix multiplication of a matrix phi size 256,512 with transpose of each row of image of size 1,512…
this seems like
y(256,1)= phi(256,512) * image(1,512)’;
note that matrix phi is constant throughout the code.”

where I can write
dim3 dB(512, 1, 1);
dim3 dG(256, 512, 1);
because dim3 is used for grid dimention…
should kernel call like
dotproduct<<<1,512>>>(d_image,d_phi,d_pro);

declaration and assignment of shared memory of type type

e.g. if image and phi are of type float, it would be

shared float smem[512];

a reduction is like a sum scan where the sum of an array is accumulated over the elements
a reduction however only minds the overall (last) sum, not each element’s accumulation
refer to the samples if you are uncertain as to how to program one

if you installed cuda, you should have the accompanying samples
refer to the scan and reduction samples
(seemingly, they are stored under 6_advanced; which is baffling - if a scan/ reduction is ‘advanced’, all hope is lost)

“I can not understand how to keep the resulting array as (256,1) after multiplication”

the result is stored in product, which in turn is a 1 dimensional array of 256 x 512 elements
the result arrays of (256, 1) are stored back to back in this array
given its stipulated dimensions, the kernel has 256 x 512 blocks; each block will write a single result value (a matrix row-column multiplication product) to product
that is parallel processing at its purest, if you think about it

In all examples of matrix multiplication,either dealing with same sizes of both matrics like (N*N) or user can define sizes in main of .c files…
but i am using mxGPU functions.and my code starts with mwxFunction rather than main.
I use every technique to give sizes by myself but cant succeed.
if i do like
result(256,512)=phi(256,512)*image(512,512)
this can’t give the proper result.