Finally I found a good way to use CUDA. But I am not sure if it can be done efficently.
What I want to do is:
I have 4 float dynamic arrays all of them with size m (m is usually around 1000 )
As this:
unsigned int mem_size = sizeof( float) * m;
float* timestamp = (float*) malloc( mem_size);
float* measurement = (float*) malloc( mem_size);
float* oX = (float*) malloc( mem_size);
float* oY = (float*) malloc( mem_size);
And I have 2 more dynamic float arrays with size n(n can be from hundreds to thousands)
As this:
mem_size = sizeof( float) * n;
float* x = (float*) malloc( mem_size);
float* y = (float*) malloc( mem_size);
What I want to do is I want to create n threads.
And I want to pass all those 4 float arrays to each of these threads.These arrays are read-only.
Each thread is going to make some calculations using those 4 float arrays(I need all m*4 values) for each x[thread_id],y[thread_id] pair and should write 3 output values into 3 output arrays.
mem_size = sizeof( float) * n;
float* out1 = (float*) malloc( mem_size);
float* out2 = (float*) malloc( mem_size);
float* out3 = (float*) malloc( mem_size);
out1[thread_id]=calculatedValue1;
out2[thread_id]=calculatedValue2;
out3[thread_id]=calculatedValue3;
It looks fairly suitable for CUDA so I rushed into coding.(My first CUDA program)
Then questions arised…
Since all threads need all of 4 input float arrays for calculations. I thought I should put these arrays into shared memory for fast access.But as m is around 1000 which means it will be 41000sizeof(float) bytes which is nearly 16K.
In the Cuda Programming Guide it says:
“The amount of shared memory available per multiprocessor is 16 KB”
If “m” were bigger than 1000 then it wouldnt be possible to put those arrays into shared memory right?
If I had to leave those 4 arrays in device memory then would performance penalty be too big? Because then I have to read from device memory 4mN(iteration_count which is big)=TOO BIG
Any other idea how can I do these operations more efficently on CUDA.
Thanks.
Akif,
For now I called my kernel like this:
unsigned int mem_size = sizeof( float) * m
// define grid and block size
int numThreadsPerBlock = 512;
// Compute number of blocks needed based on array size
// and desired block size
int numBlocks = n / numThreadsPerBlock+ (n%numThreadsPerBlock == 0?0:1);
// setup execution parameters
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
// execute the kernel
testKernel<<< dimGrid, dimBlock,4*mem_size>>>( d_measurement, d_timestamp......... );
And in my kernel I wrote:
extern shared float sdata;
float measurement=(float)sdata;
float timestamp=(float)&measurement[m];
float oX=(float)×tamp[m];
float oY=(float)&oX[m];
then I think I have to copy from device memory into shared memory with loops.
am i on the right path?