Can this be sped up?

Hi. I am working on adapting my program to CUDA; it is currently written in C using MPI. I have extracted the core logic part of my program and put it all in one function that is called millions of times throughout the execution of my program. I have already created the kernel to execute this code, but it takes about 4 times longer to do this with CUDA than with my CPU. Here is the code:

__global__ static void CudaCheckKernel(float *d_mid1, int *d_check, int length, int maperiod1, int maperiod2, int maperiod3, int maperiod4)


	const int	 tid = blockIdx.x * blockDim.x + threadIdx.x;

	const int threadN = gridDim.x * blockDim.x;

	int BuyOK=0,SellOK=0,count,count2,bar;

	float ma1=0,ma2=0,ma3=0,ma4=0,ma5=0,ma6=0,ma7=0,ma8=0;


	for (count2=tid;count2<=length;count2+=threadN)




				//Calculates 8 moving averages

		for (count=0;count<maperiod1;count++)







		for (count=1;count<maperiod1+1;count++)






		for (count=2;count<maperiod1+2;count++)







		for (count=0;count<maperiod2;count++)







		for (count=1;count<maperiod2+1;count++)






		for (count=2;count<maperiod2+2;count++)






		for (count=0;count<maperiod3;count++)






		for (count=0;count<maperiod4;count++)





				//Compares second derivatives

		if (((ma1-ma2)-(ma2-ma3))>0 && ((ma4-ma5)-(ma5-ma6))>0 && ma7>ma8)


		if (((ma1-ma2)-(ma2-ma3))<0 && ((ma4-ma5)-(ma5-ma6))<0 && ma7<ma8)



				//Records results in the d_check array


		if (BuyOK==1 && SellOK==0)


		if (BuyOK==0 && SellOK==1)




Can this kind of code be accelerated by CUDA? If it can be, can I please get some hints? The input and output arrays are kept as small as they can be, and they are always nearly the same size which can be anywhere from 1MB to 300MB.

I really don’t know what you are doing, ma2, ma3, ma5, ma6 seem to calculate the exactly same thing as the ma1 or ma4 the next two threads calculate, just without memory coalescing.
All together I’d say you are at least loading data and calculating about 32x as much as a sane implementation would (well, assuming the period lengths are quite long), unless your CPU code is just as horrible I’d say 4 times slower is a quite good result, given the code.
Use shared memory to load commonly used data (coalesced!) and to share results.

There’s a ton of things you can do to speed this up. The most important is probably to change your algorithm! A moving average can be done iteratively. The moving average for time t is just the moving average from (t-1) plus a correction factor equal to

next, you’re doing all kinds of reads from global memory in the loops… you can likely do all the loads at once and dump them to shared memory for faster access.

Next, when writing the results, you’re diverging your branches by having three branches that write the result to global memory.
You can improve that write speed (both latency and bandwidth) by a factor of 3 by computing the answer first (with branches), then writing the answer with a SINGLE write to global memory.

But there’s probably even more optimizations to do… there’s stuff you can do with the incremental average computation that’d give you many factors speedup on CPU compute too.

I wouldn’t trust my CUDA card to make BUY or SELL decisions on anything ;)

Thanks a lot for your suggestions!

Each of the MAs is diffferent. ma2 and ma5 are offset from ma1 and ma4 by one, respectively, and ma3 and ma6 are offset by two from ma1 and ma4, respectively. ma7 and ma8 are MAs with different periods than all the others. The periods of the MAs can be quite long, like 2000 or possibly more, and the data input and the check output arrays are dynamically sized to be as small as possible while still being big enough to calculate all the MAs.

I am going to try and convert to incrementally calculating the MAs because I see how that can really speed this up. The problem is, while investigating this slowness last night I discovered that it is the actual kernel invocation that takes so long. My CPU can calculate all the results in 110 milliseconds, while it takes the GPU around 550 ms. If I comment out the kernel<<<>>> line (and do nothing else) the GPU completes in about 13 ms. If I uncomment the kernel<<<>>> line and comment out the entire body of the kernel() function, the GPU takes about 350 ms to complete.

So I am thinking that the bulk of my delay is coming either from the kernel invocation itself (initializing the device maybe?) or the copying of data for the kernel to work on. But, as far as I know, the input data array is staying in global memory so it should not have to move it at all. Can anyone think of why the kernel<<<>>> call would take so much time here?

Ok, so I have done some more testing and discovered that the cudaMalloc is what is taking so long. It takes about 300 ms just to malloc a one dimension array of floats with 2000 elements. The way my code is written, I have an mpicc compiled main file that calls the function CudaCheck() from the nvcc compiled file, and CudaCheck() calls the kernel CudaCheckKernel(). Every time CudaCheck() is called I malloc the data on the device and free it at the end of CudaCheck(). Obviously this is a huge waste of time. Is there a way to malloc all of my input data in global mem and keep it there so that I don’t have to call cudaMalloc every time I call CudaCheck()? I’m afraid I may have to move more of code to CUDA because I still have the loops that call the code I originally posted in my MPI code. I could get much better performance from the GPU if I moved all those loops onto the GPU I think.

There is nothing magical about cudaMalloc that requires it to be compiled by nvcc. Just #include “cuda_runtime.h” in your main file and cudaMalloc the array once at the beginning.

That is very good to know! So what does have to be compiled by nvcc? I am using mpicc to compile my main program, and it would be awesome if I could call the kernel<<<>>> from my mpi code.

The only things that need to be compiled by nvcc are the kernels, kernel<<< >>> calls, texture bind calls, and cudaMemcpyToSymbol with the symbol referenced by variable (by string, it may work from anywhere, I haven’t tried it). If you want to be able to do all these things from straight c and only compile kernels with nvcc, then you need to use the driver API.