I am writing a program that applies a Hanning window to some code. However, the speed of the window is extremely slow, and I think it has to do with my misuse of the memory on the card. I have a 1D array of, for example, 100 chunks of data that are 800 datapoints long, for a total 1D array length of 100*800. I would like to apply the Hanning window to each of the 100 data chunks. To do this, I have done the following:
1.) Used cudaMalloc to create a GPU memory chunk that is 100*800
2.) Used cudaMemcpy to copy my data that is in computer memory to the memory created in step 1
3.) Call the Hanning window function using the the following: <<<1, 10>>>
4.) Call __syncthread
5.) Copy the Hanning window GPU data back to the computer memory and view
This program takes around 800ms to complete, and I noticed in the manual that it takes hundreds of clock cycles to read from global memory, which I think my program is doing. My question is, how do I change the memory that I allocated from global to shared? Is it possible to break up the memory created in step 1 into multiple shared memory modules? How is this done?
I will be happy to provide code examples if necessary. Thank you in advance for your time and help.