How can I configure this problem is it too big to fit in shared memory?

Finally I found a good way to use CUDA. But I am not sure if it can be done efficently.

What I want to do is:

I have 4 float dynamic arrays all of them with size m (m is usually around 1000 )

As this:

unsigned int mem_size = sizeof( float) * m;

float* timestamp = (float*) malloc( mem_size);
float* measurement = (float*) malloc( mem_size);
float* oX = (float*) malloc( mem_size);
float* oY = (float*) malloc( mem_size);

And I have 2 more dynamic float arrays with size n(n can be from hundreds to thousands)
As this:

mem_size = sizeof( float) * n;

float* x = (float*) malloc( mem_size);
float* y = (float*) malloc( mem_size);

What I want to do is I want to create n threads.
And I want to pass all those 4 float arrays to each of these threads.These arrays are read-only.
Each thread is going to make some calculations using those 4 float arrays(I need all m*4 values) for each x[thread_id],y[thread_id] pair and should write 3 output values into 3 output arrays.

mem_size = sizeof( float) * n;
float* out1 = (float*) malloc( mem_size);
float* out2 = (float*) malloc( mem_size);
float* out3 = (float*) malloc( mem_size);

out1[thread_id]=calculatedValue1;
out2[thread_id]=calculatedValue2;
out3[thread_id]=calculatedValue3;

It looks fairly suitable for CUDA so I rushed into coding.(My first CUDA program)
Then questions arised…

Since all threads need all of 4 input float arrays for calculations. I thought I should put these arrays into shared memory for fast access.But as m is around 1000 which means it will be 41000sizeof(float) bytes which is nearly 16K.

In the Cuda Programming Guide it says:
“The amount of shared memory available per multiprocessor is 16 KB”

If “m” were bigger than 1000 then it wouldnt be possible to put those arrays into shared memory right?

If I had to leave those 4 arrays in device memory then would performance penalty be too big? Because then I have to read from device memory 4mN(iteration_count which is big)=TOO BIG

Any other idea how can I do these operations more efficently on CUDA.

Thanks.

Akif,

For now I called my kernel like this:

unsigned int mem_size = sizeof( float) * m

// define grid and block size
int numThreadsPerBlock = 512;

// Compute number of blocks needed based on array size 
// and desired block size
int numBlocks = n / numThreadsPerBlock+ (n%numThreadsPerBlock == 0?0:1); 

// setup execution parameters
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);


// execute the kernel
testKernel<<< dimGrid, dimBlock,4*mem_size>>>( d_measurement, d_timestamp......... );

And in my kernel I wrote:

extern shared float sdata;
float measurement=(float)sdata;
float timestamp=(float)&measurement[m];
float oX=(float)&timestamp[m];
float oY=(float)&oX[m];

then I think I have to copy from device memory into shared memory with loops.
am i on the right path?

Hi,

I think yes, you are on the right path.
If you want your program to perform over 16K total vector data then you have to load from global memory every time. However, if you align the acess correctly you will get near to the proposed peak memory bandwidth which is more than on every other system I know.

Is it possible to break the relations between the data to let some work for the first X elements be done by one block the next by another block an so on…?
Store the intermediate values somewhere and then do a reduction on the work.
That way you could break your program to run one block per MP having now 16K times 12 shared memory for a GTX.
However your problem still exists if you want more than 16K times MPs data per run.

So i would optimize the alignment of the code for global memory you should get at least 85% MemBW with a perfect aligned data access. The next generation comes end of the year and will double your performance :-).

Thank you for your reply.

Yes my problem can be divided actually in the half but i never thought it before like that before and I think it would be hard to realize it (at least for me). Right now I finished the code and I am getting
“too many resources requested for launch” for debug mode
and totally wrong results for release mode??

Isnt this strange? Shouldnt I also get “too many resources requested for launch” also for release mode if i am passing shared memory limit??

Yes you should. Perhaps you are requesting to much global memory, going to the host in emu mode? E.g you have 2 GB of main mem. You request 1 GB for the host and 1 GB for the device. In Emu mode this won’t work.

Although check to cubin file to be sure. Compiling with -cubin instead of -c

Johannes

when I reduce the number of threads per block from 512 to 384 it runs perfectly. But it is still interesting why it is not giving the correct error message in release mode.

Are you using CUT_CHECK_ERROR? That macro compiles to nothing in release mode.

What next generation? Can you tell us more about it?

Yes I was using that macro.I didnt know it was ignored in release mode. Thank you.