Converting Global to Shared memory

Hello everyone,

I am writing a program that applies a Hanning window to some code. However, the speed of the window is extremely slow, and I think it has to do with my misuse of the memory on the card. I have a 1D array of, for example, 100 chunks of data that are 800 datapoints long, for a total 1D array length of 100*800. I would like to apply the Hanning window to each of the 100 data chunks. To do this, I have done the following:

1.) Used cudaMalloc to create a GPU memory chunk that is 100*800
2.) Used cudaMemcpy to copy my data that is in computer memory to the memory created in step 1
3.) Call the Hanning window function using the the following: <<<1, 10>>>
4.) Call __syncthread
5.) Copy the Hanning window GPU data back to the computer memory and view

This program takes around 800ms to complete, and I noticed in the manual that it takes hundreds of clock cycles to read from global memory, which I think my program is doing. My question is, how do I change the memory that I allocated from global to shared? Is it possible to break up the memory created in step 1 into multiple shared memory modules? How is this done?

I will be happy to provide code examples if necessary. Thank you in advance for your time and help.

Austin McElroy

So you’re using one block and ten threads. This means you’re only using one of the multiprocessors on the card and don’t have nearly enough threads in flight for the card to hide memory latency…

Thanks for responding. I have tried using 32, 64, ect threads and this doesn’t seem to help speed things up. Changing the number of blocks from anything but 1 causes errors in the output of the data, therefore I selected 1 block because it gives the correct output, and 10 threads because I saw no difference between 10 and 64. 256 threads with 1 block slows things down by a factor of 2.

Here is the part of the code that is slowing things down, starting with the function call, keeping in mind that 100 is the number of arrays, and 800 is the number of data points in each array:

for(int i = 0; i < NumberOfArrays; i++)

{

	PreProcessGPU<<<1, 64>>>(&ptrGPUDataSet[i*800],  100, 800);

}

Next, is the function that is being executed on the Video Card:

__global__ void PreProcessGPU(float *ptrGPUDataSet, short int NumberOfArrays, short int LengthOfDataSets)

{

	int counter0;

	float PI = 3.14159;

	for(counter0 = 0; counter0 < LengthOfDataSets; counter0++)

	{

  ptrGPUDataSet[counter0] = ptrGPUDataSet[counter0]*(.5 -  .5*cos(2*PI*counter0/LengthOfDataSets));

  __syncthreads();

	}  

}

Thanks,

Austin McElroy

Edit: Didn’t put the code inside a code tag, fixed it. Also cleaned up some code.

The reason it doesn’t get any faster when you use more threads is that your program is not written in a parallel way. You have every thread doing the exact same thing. So if you add more threads, you simply have more threads doing all of the work.

I suggest looking at some of the examples.

But that is what I want, to perform the same Hanning operation on each chunk of 800 datapoints. I am not a CS or CE by training, do you have any multi-threading references or websites that might help me see what I am doing wrong? I have spent a lot of time with the SDK, but it is not obvious what errors that I am making.

Thanks,

Austin McElroy

As eelsen pointed out, all of your threads are doing redundant work - they read the same data elements, perform the same operations, and write the same result to the same location. So, you get the same value computed and written 10 times (since you launch 10 threads per block). This gives you no benefit, since doing all of this once would accomplish the same outcome.

What you want, is each thread to compute a different element in your output. As far as I can tell, iterations in your for-loop are independent. As are the 100 chunks of data. So, there is potential for very significant speedup (if I had to guess, at least several hundred times):

  1. You should launch a thread to do the computation for each iteration (that’s 800 threads right there, doing 1/800th of what your current threads do), rather than having a loop inside your CUDA kernel.

  2. Next, instead of looping in your CPU code numberOfArrays times (100), launch threads to do all that work simultaneously.

So, in your CUDA code, you’ll have to set the variable counter0 with some expression in terms of block and thread indices within the launch grid.

The catch is that a thread block can only have up to 512 threads. So, you’ll have to break up your the threads processing one chunk into two or more blocks. Assume that a threadblock is 200 threads. Then 4 consecutive threadblocks will work on one chunk. So, you’ll need just one launch of a CUDA kernel, with configuration something like <<<400,200>>>.

Paulius

Try something like the following in bold TOT_THREADS is the number of threads per block times the number of blocks you specify in the <<BLOCK_NUM, THREAD_NUM>>. The exact same code is run in parallel threads, but each thread has a different id. So you need to get that Id and then use it to determine which data to process. I hope this helps

__global__ void PreProcessGPU(float *ptrGPUDataSet, short int NumberOfArrays, short int LengthOfDataSets)

{

int counter0;

float PI = 3.14159;

int tx = threadIdx.x;

int bx = blockIdx.x;

for(counter0 = BLOCK_NUM*bx+tx; counter0 < LengthOfDataSets; counter0 += TOT_THREADS)

{

 ptrGPUDataSet[counter0] = ptrGPUDataSet[counter0]*(.5 -  .5*cos(2*PI*counter0/LengthOfDataSets));

 __syncthreads();

}  

}

Oh my gosh, thank you so much!!! I understand now the difference in what I was doing before and the correct way. The examples are also very understandable. By setting up a dim3 variable for both blocks and threads, I can scale blocks and threads programatically. Then, each thread computes one calculation, i.e. 1 thread computes a single data points hanning window value. I must then use the blockIdx.x and blockIdx.y and threadIdx.x and threadIdx.y to figure out exactly where I am in my data array.

Thanks guys!

Austin McElroy

If you don’t mind, please post the timing results after code modifications.

Paulius

Sure Paulius. After redoing the code such that the loops are thread independent, the timing dropped from 800ms to 21-23ms. I am doing a Labview DLL, so there might be some overhead from memory copies and such. The Pentium D 2.8GHz computer that I am comparing the NVIDIA card against runs at roughly 21-23ms. One thing that is odd is that the CPU usage is the same for what I am doing on the GPU vs. what I am comparing it to in Labview are both eating 100% of the 2.8GHz CPU. The memory copies shouldn’t be causing it. Any ideas?

Good, the speedup from the fix is about what I thought it should be. As far as CPU usage goes, a launch of a single CUDA kernel should be asynchronous (from the CPU point of view). Subsequent launches may block. We know this has caused some confusion, so CUDA 1.1 will address the asynchronicity issues explicitly in the API.

Paulius

Is the Call Library Node on your diagram set to run in the UI thread or in any thread (aka is the top of the node Orange or Yellow)? If it’s running in the UI thread, that’s one possibility. When you drop it on the diagram that’s the (safe) default.

As long as your DLL is thread-safe, it should be configured to use any thread. That will also help LabVIEW use the multiple cores on the system to execute your program. You won’t be serializing the DLL calls in one thread on one CPU core.

If you want, you can still serialize by wiring the error input/output on the library node and subsequent VIs.

I don’t know too much about LabVIEW, but one warning here: CUDA contexts can’t be shared between threads, so whichever thread initializes the device needs to also make all the kernel invocations.