How can I configure this problem is it too big to fit in shared memory?

akift · June 13, 2008, 5:52pm

Finally I found a good way to use CUDA. But I am not sure if it can be done efficently.

What I want to do is:

I have 4 float dynamic arrays all of them with size m (m is usually around 1000 )

As this:

unsigned int mem_size = sizeof( float) * m;

float* timestamp = (float*) malloc( mem_size);
float* measurement = (float*) malloc( mem_size);
float* oX = (float*) malloc( mem_size);
float* oY = (float*) malloc( mem_size);

And I have 2 more dynamic float arrays with size n(n can be from hundreds to thousands)
As this:

mem_size = sizeof( float) * n;

float* x = (float*) malloc( mem_size);
float* y = (float*) malloc( mem_size);

What I want to do is I want to create n threads.
And I want to pass all those 4 float arrays to each of these threads.These arrays are read-only.
Each thread is going to make some calculations using those 4 float arrays(I need all m*4 values) for each x[thread_id],y[thread_id] pair and should write 3 output values into 3 output arrays.

mem_size = sizeof( float) * n;
float* out1 = (float*) malloc( mem_size);
float* out2 = (float*) malloc( mem_size);
float* out3 = (float*) malloc( mem_size);

out1[thread_id]=calculatedValue1;
out2[thread_id]=calculatedValue2;
out3[thread_id]=calculatedValue3;

It looks fairly suitable for CUDA so I rushed into coding.(My first CUDA program)
Then questions arised…

Since all threads need all of 4 input float arrays for calculations. I thought I should put these arrays into shared memory for fast access.But as m is around 1000 which means it will be 41000sizeof(float) bytes which is nearly 16K.

In the Cuda Programming Guide it says:
“The amount of shared memory available per multiprocessor is 16 KB”

If “m” were bigger than 1000 then it wouldnt be possible to put those arrays into shared memory right?

If I had to leave those 4 arrays in device memory then would performance penalty be too big? Because then I have to read from device memory 4mN(iteration_count which is big)=TOO BIG

Any other idea how can I do these operations more efficently on CUDA.

Thanks.

Akif,

For now I called my kernel like this:

unsigned int mem_size = sizeof( float) * m

// define grid and block size
int numThreadsPerBlock = 512;

// Compute number of blocks needed based on array size 
// and desired block size
int numBlocks = n / numThreadsPerBlock+ (n%numThreadsPerBlock == 0?0:1); 

// setup execution parameters
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);


// execute the kernel
testKernel<<< dimGrid, dimBlock,4*mem_size>>>( d_measurement, d_timestamp......... );

And in my kernel I wrote:

extern shared float sdata;
float measurement=(float)sdata;
float timestamp=(float)&measurement[m];
float oX=(float)&timestamp[m];
float oY=(float)&oX[m];

then I think I have to copy from device memory into shared memory with loops.
am i on the right path?

JHHPC · June 14, 2008, 3:26pm

Hi,

I think yes, you are on the right path.
If you want your program to perform over 16K total vector data then you have to load from global memory every time. However, if you align the acess correctly you will get near to the proposed peak memory bandwidth which is more than on every other system I know.

Is it possible to break the relations between the data to let some work for the first X elements be done by one block the next by another block an so on…?
Store the intermediate values somewhere and then do a reduction on the work.
That way you could break your program to run one block per MP having now 16K times 12 shared memory for a GTX.
However your problem still exists if you want more than 16K times MPs data per run.

So i would optimize the alignment of the code for global memory you should get at least 85% MemBW with a perfect aligned data access. The next generation comes end of the year and will double your performance :-).

akift · June 16, 2008, 7:31am

Thank you for your reply.

Yes my problem can be divided actually in the half but i never thought it before like that before and I think it would be hard to realize it (at least for me). Right now I finished the code and I am getting
“too many resources requested for launch” for debug mode
and totally wrong results for release mode??

Isnt this strange? Shouldnt I also get “too many resources requested for launch” also for release mode if i am passing shared memory limit??

JHHPC · June 16, 2008, 9:14am

Yes you should. Perhaps you are requesting to much global memory, going to the host in emu mode? E.g you have 2 GB of main mem. You request 1 GB for the host and 1 GB for the device. In Emu mode this won’t work.

Although check to cubin file to be sure. Compiling with -cubin instead of -c

Johannes

akift · October 11, 2008, 1:50pm

when I reduce the number of threads per block from 512 to 384 it runs perfectly. But it is still interesting why it is not giving the correct error message in release mode.

MisterAnderson42 · October 11, 2008, 5:47pm

Are you using CUT_CHECK_ERROR? That macro compiles to nothing in release mode.

nasacort · October 13, 2008, 5:02pm

What next generation? Can you tell us more about it?

akift · October 14, 2008, 8:41pm

Yes I was using that macro.I didnt know it was ignored in release mode. Thank you.

Topic		Replies	Views
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3045	November 13, 2017
Need Help with Shared Memory Allocation for 1D and 2D Arrays in CUDA CUDA Programming and Performance	15	583	May 16, 2024
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2469	August 4, 2008
Odd performance problem/question CUDA Programming and Performance	3	835	June 3, 2009
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3168	May 6, 2007
Not enough shared mem CUDA Programming and Performance	5	5771	November 3, 2009
Help to understand the frame of CUDA programming CUDA Programming and Performance	2	1420	November 30, 2014
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	704	September 29, 2018
Using shared memory in device function and allocate required shared memory in global function CUDA Programming and Performance	2	32	April 14, 2025

How can I configure this problem is it too big to fit in shared memory?

Related topics