CUDA with Microcontroller CUDA kernel on data sent by Microcintroller


I am developing an application, in which:

1- I have to sense the data using a Sensor and a Microcontroller.

2-The Microcontroller then communicates with the PC via serial port, and delivers the data (sensed by the sensor) to the PC.

3-As the data is received by the PC, I want to run a CUDA kernel for finding the FFT in real time.

4-I now serial programming in C, where I have been using functions like inport and outport along with the port address for getting the data from serial port.

I am wondering how can I do this in CUDA, in real time.

Any pointers in this regard will be highly appreciated.


What is the data rate of the serial link you are using? A standard PC UART only goes to 14.4kB/s, which is such a modest data rate that just about anything should be able to compute the FFT in “real time”.

Thanks for your reply.

Yes, the data rate is around 14kbps. Could you please tell me how should I proceed. Basically I want to know if it is possible to read the data directly from the GPU, or shall I first get the data in main Memory, and then use cudaMemcpy(, HostToDevice) to get this data in Global memory and then perform the FFT.

You’ll be wanting cudaMemcpy. I’m quite suprised a CPU can’t do 14kB/s in real time.

FFTW3 should be able to do 14kb in under 1 millisecond on just about any IA-32 or x86_64 CPU manufactured during this century. With that little data it will be slower with CUDA.

It depends on how big of an FFT he wants to do. The time complexity of an FFT is NlogN, while the data transfer complexity is N. There is some (possibly very large) value of N for which it will take longer to perform the FFT than it takes to copy the data. But yeah, in this case I would bet that you are right.

Well, is it really worth doing on a GPU? I think it is worth only if we are finding FFT of very big data… Can any one verify my understanding, Please?

Thanks for the valuable information. Well, I just want to know if suppose I perform the FFT (on the data that I am getting from serial port) on CPU and then on GPU using CUDA, will I not get any speed up. I am not able to understand why not, because FFT is a very computationally expensive task, and by applying CUDA it should accelerate the processes.

Hope this is not a silly question :)

Getting back to this older post. I want to know If after acquiring data at serial port can I find the FFT using CUFFT library? At present I am not looking for much speed up, but just want to know if CUFFT can help me in this problem" The problem is:

1-Aquire data at serial port using microcontroller,

2-Apply FFT using CUFFT on the incoming data

3-Out put the data back to the serial port

If you are receiving data and transmitting results using a standard UART, it probably won’t matter what you use to compute the FFT, because the serial link will be the bottleneck. FTTW will probably work out to be simpler and faster than using cudaFFT in an application like that. It is extremely fast on CPUs with SSE/SSE2 support (which basically means any mainstream CPU manufactured in the last decade).

Given that you are sending the FFT data back to the microcontroller, another option is to drop the CPU entirely and go to a microcontroller with DSP features. (Wikipedia tells me that some vendors call these “digital signal controllers”.) Many of these should come with optimized FFT libraries that can easily keep up with your data rate as well.


But suppose I am NOT sending the data back to the microcontroller, then? (lets not discuss for the time being what I am going to do with the data on in my PC that I get after processing ) I think I must get speed up as compared to FFTW. (the thing is I HAVE to do the processing on GPU, and thus MUST use CUDA or optimized library for GPUs)

That just doesn’t follow. 10 seconds with google produces this, which seems to disprove it. There is even links to code you could use or adapt to test your hypothesis yourself on your own hardware. It seems to be a couple of years old, so maybe cuFFT performance has improved relative to FFTW since it was written, but I doubt the conclusions will be very different.

So I take it this entire thread is some sort of gedankenexperiment where you have committed yourself to doing gpu based project (school I guess) and are now looking for a problem to work on?

I am assuming the same thing. And here is my advice. If it is a college class you are working on, first talk to the professor. Tell them that you do not think that using CUDA will give you a speed up. You could even do it. Then compare the CUDA version to a CPU version, and explain why the CUDA version is slower. One aspect of parallel computing is determining when something can or cannot be parallelized. Also, later on you can easily change the code if you have a different method of transferring the data other than through a serial port.

If you need to write a paper on this, you still could. There is plenty of things you can write about to show that you still learned something. But most importantly talk to your professor. If this is for a work project, you probably shouldn’t be doing it.

But to answer your question. You will first send the signal through the serial port. Store the data using CPU/memory and then you will have to use the cudaMemcpy to get the data onto the GPU.

Thanks! I will keep these things in mind in future. Thank again!

So that’s it? What a complete waste of everyone’s time…

Oh Avid… THats too harsh for a student… Chillll…

btw, Kiran – The strength of the chain is the strength of the weakest link… There is no point in strengthening other portions of the chain until you solve th weak-link problem…

The weak-link in your case is the serial port. Its very slow…and cannot produce data that would require a GPU to crunch…