Noob question

Hi all,

Just a question,

do all device executed functions have to be void? if so why is this?

Is there a way to work around this? Because at the moment i have a function that i wish to execute on the device, but requires another function call within it. However ideally id like both functions to execute on the device but one of the functions is an unsigned char return type.

I suppose it is possible to change the return type, but it would be much easier and more programmer friendly not to :P


I think you can have other return type in device functions. They get inlined anyway.

device functions can have return types. However, global functions (call initiated by host, but runs on device) must have void return type. To return values from global functions, you have to write the data to global memory and cudaMemcpy() it back to the host.

Ahhh ok cool thanks for that. Makes a lot of sense really.

So what about functions like printf, fprintf etc etc can they be run on the device?

thanks and sorry if these are really stupid questions :s

No, (screen) IO is not supported: a gpu as no direct access to drives, streams etc. Of course, you can use the openGL calls to visualize your data.

OK thanks for clearing that up for me.

Another question :) …i have got an implementation of the AES encryption process that I have hacked over to work on the device. (although i havnt set up anything with regards to thread blocks etc just yet).

The time taken to compile the code is increased as is the time to complete the encryption process. I expected that the process would run a lot faster on the device, However there is a significantly longer time taken to run the same code. Does this have something to do with the fact that it is expensive to work with integers on the device?

AES does not use any expensive integer operations.
The real question is how (and for which purpose) is AES implemented and what kind of data parallelism do you exploit (i.e. are you encrypting many streams of data or just one stream, in waht mode etc.).

IMO you reaaly should read some basic information in Programming Manual before asking. Otherwise you won’t be able to understand answers :)

ECB encoding/decoding you can do really fast on the device, as you can encrypt every block in parallel. But this is also quite insecure. For CBC, which is usally employed, can do encoding in parallel but not decoding.

Don’t forget you also need to send all data that you want to encrypt/decrypt over the bus. All in all the device is not going to be a big help for a cheap algo like AES (unless you just want to bruteforce keys, this also has a large parallelism factor).

Actually, decrypting can be done in parallel, not encrypting. When encrypting next plaintext block is XORed with previous ciphertext result, so to encrypt i-th block you need (i-1)th to be already encrypted.

The idea behind this is to allow random access for CBC encrypted data without decrypting all the data.

It’s not clear how many threads or blocks you are using. If the answer is 1 thread, and 1 block, then the performance of the GPU will be very poor. The ALUs on the graphics card are clocked at probably half the rate of your CPU, PCI-Express adds overhead, and the effective memory latency is much higher (though the GPU has much more memory bandwidth when you coordinate the reads between threads).

The performance isn’t even linearly related to the number of threads with blocks that small, so you can’t extrapolate the running time from a one thread test up to N threads with multiplication. If you want a rough benchmark to start, you should try 64 threads x 64 blocks. That probably won’t be optimal, but it will usually get you to within an order of magnitude of the best case.

thanks again guys for your posts. I have read the programming manual but just needed a few things clearing up.

Im using CUDA as part of my final year project at uni so i dont need the AES to actually encrypt anything important, i just wanna try and display the power of the card over the CPU. But I am the first person at the Uni to have done any GPU computing so im kind of out on my own. Im sort of learing as i go so thats why i was saying sorry about the questins i was asking.

Apologies for the incorrect Integer operations i was talking about. that was me trying to read that in a hurry before lecture ;) totally read that wrong ha!

Ok your information was really helpful thanks so much. Just one more thing…does all my execution configuration code go in the kernel files? i currently have implemented the algorithm in a .cu file and then im getting the impression that the kernel file will be where i define my execution model.

Thanks again so much, bit of a steep learning curve here ;)

I think it is good for you to take a look at the SDK to understand what some of those programs do. And maybe I’m getting it wrong but why do you take such a complex application to show the enormous computation power of the GPU? Why not make an Matrix-Matrix multiplication comes close to the Rijndeal algorithm also matrices and give a good impression of the power of the GPU.

Yes i have looked at the programs but I was just looking for a bit of clarification on my thoughts.

I have chosen to use the AES because it was an algorithm i was already familiar with and had discussed it with my tutor and agreed that it would be a good idea. It wasn’t impossibly hard to implement anyway, so all I had to do was add the CUDA extensions to allow it to run on the Device.

We thought it would be a good algorithm to show the parallel capabilities of the device on a basic level, with an algorithm that is familiar to a lot of people already.

Well, AES isn’t new to CUDA; I’ve seen CUDA-targeted implementations more than half a year ago.

Also, I really do not understand how can be AES (or any other encryption algorithm) a good example for parallel processing. As mentioned, only ECB mode may be processed in parallel, but that’s really bad from security point of view. Modern encryption algorithms are pretty fast on CPUs so you won’t get too much from offloading them to GPUs, especially considering upload/download to card.

BTW, I’ve seen some paper in CUDAZone regarding AES on GPU. Numbers given there are somewhat misleading (because they don’t count upload/download) but it still may be interesting for you.

Well your probably right but the whole project is just a little discovery into what can be done.

Plus i have also read the GPU Gems 3 chapter on AES and there are details of various ideologies on how the process can be made more parallel. so just playing with some of those concepts.

you can and bang on about how impractical the project if you want, but like I said I’m not looking for a mission critical piece of software. I aim to just see what can be done. I dont want to send any of this encrypted data anywhere. its just a test using the Rijndael methodology.