well how do I know if cuda runs on the gpu

so i wrote some cuda programms during the last weeks and it’s all fine and dandy. I used the emulator for all this fun, since I had no cuda capable card. Now I just got a GT8800 two days ago and though, time to run the programs, lean back and brag how fast this stuff is…

…ok this was the theory, after running several different programs it turned out that they are all slower than the plain c++ version. Not much just a couple of milliseconds. But still.

Now I guess that for some reason these functions are actually running in emulate mode, since I could not explain it otherwise. I also noticed that the cpu is running at 100% during the execution time which is another indicator that it runs on the gpu instead of the gpu.

Now how can i make sure and check that it’s actually running on the graphic card?

and my current example which should be pretty fast with cuda is this file:

test file to compare c++ vs cuda

It basically calculates the similarity between two massspecs for several thousand times. Just to see how much faster cuda vs c++ is.

thanks in advance.


Remove the “-deviceemu” option while compilingand your code will run on GPU

well this is already done and does not help. As I said I somehow doubt its executed on my gpu and is executed on the cpu.

specially I specified two build targets,

device = run on device

emulate = emulate the code

and both have the same execution time. So something is off.



now I removed all references to emulate and still no progress. Anybody has an example which works with a SELF written make file? Basically I hope to see a simple example which compares c/c++ vs cuda.

edit 2:

this is an example of the runtime I get with the give code:

count cuda c++

150000 155.590744 156.657154

which says that cuda is 1 second faster,

thx again

A good way to check if it is running on GPU is to use the CUDA (Visual) Profiler.

If you want to be convinced it runs on GPU add a “printf” inside your kernel, which cause a compilation error when not in emulation mode.
There are many problems which can make your code run slowly on GPU, yet cant tell what it is as your attachment is inaccessible for some reason.

The 100% CPU usage is likely a red herring… if you start a kernel then immediately try to do a mem read to fetch results, the mem read will spin, using 100% CPU, waiting for the GPU to finish. This is documented, and done on purpose to reduce latency,
and can be avoided if it matters. (see the API guide for the async functions)
Many of the example solutions behave this way.

I mention this because it really confused me for a few hours once, I also thought the CPU was somehow being used because of the 100% CPU peg in Task Manager.

Your source code link didn’t work, so I can’t make specific comments. In general there is no reason why any particular bit of code should be faster on the GPU. Here are a few general reasons why a GPU kernel might be slower than a CPU one.

  1. Memory accesses are not coalesced (this drops memory performance to a mere ~2 GiB/s)
  2. Problem size is not large enough making the kernel launch overhead dominate the calculation.
  3. Including the first CUDA call in the timing (the first CUDA call initializes the driver which takes a long time)
  4. Algorithmic differences: i.e. a O(N) algo on the CPU might turn into an O(N^2) one on the GPU due to the data parallel requirements. This can happen easily if you have every thread loop over every element you processes.

You can attach files to posts on the forums. Just zip the .cu file or rename it to .txt so the forum will accept it.

ok sorry about the link, i got some server problems lat night. I will include a very simple example, which ist still slower on the gpu tha cpu. I read the existing docu, but well time to read it again.


multiply add, substract, divide two arrays of legth n.

and here is the code.

  1. test file


#include "arrayCuda.h"

#include "array.h"

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

double timeCpp(int count){

       //initialize timer

        clock_t start = clock();

       //calculate needed time

        clock_t end = clock();

	float* a = (float*) malloc(sizeof(float)*count);

        float* b = (float*) malloc(sizeof(float)*count);

	float* c = (float*) malloc(sizeof(float)*count);


	for(int i = 0; i < count; i++){

  a[i] = i;

  b[i] = i;


	//execute the function	





       return ((double)end - start) / CLOCKS_PER_SEC;


//time the cuda execution time

double timeCuda(int count){

       //initialize timer

        clock_t start = clock();

       //calculate needed time

        clock_t end = clock();

       float* a = (float*) malloc(sizeof(float)*count);

        float* b = (float*) malloc(sizeof(float)*count);

        float* c = (float*) malloc(sizeof(float)*count);

       for(int i = 0; i < count; i++){

                a[i] = i;

                b[i] = i;


       //execute the function





       return ((double)end - start) / CLOCKS_PER_SEC;


// main routine that executes on the host

int main(void) {

	int size = 1024 * 1024 * 128/sizeof(float);


	printf("%i\t%f\t%f\n",size, timeCuda(size),timeCpp(size));



  1. cpp/ version


#include "array.h"


 * adds two arrays


void add(float* a, float* b, float* result, int arraySize){

	for(int i = 0; i < arraySize; i++){

  result[i] = a[i] + b[i];


  1. cuda version


#include "arrayCuda.h"

__global__ void add(float *result, float *a, float *b, int size) {

	//calculate the index of the current spectra

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	//assign the result for the given calculation

	if (idx<size) {

  result[idx] = a[idx] + b[idx];




 * adds two arrays


void addCuda(float* a, float* b, float* result, int arraySize) {

	//size of needed arrays

	const int floatSize = arraySize * sizeof(float);

	//device variables

	float* deviceResult;

	float* deviceA;

	float* deviceB;

	//assign memory on the device

	cudaMalloc((void **) &deviceResult, floatSize);

	cudaMalloc((void **) &deviceA, floatSize);

	cudaMalloc((void **) &deviceB, floatSize);


	//copy to the device

	cudaMemcpy(deviceA, a, floatSize,


	cudaMemcpy(deviceB, b, floatSize,


	cudaMemcpy(deviceResult, result, floatSize,



	//calculate the block sizes

	int block_size = 256;

	int n_blocks = arraySize/block_size + (arraySize%block_size

  	== 0 ? 0 : 1);

	//do the operation

	add<<< n_blocks, block_size >>>(deviceResult,deviceA,deviceB,arraySize);


	//copy the result back into the memory

	cudaMemcpy(result, deviceResult, floatSize,cudaMemcpyDeviceToHost);

	//free up the memory







so I know its a very simple procedure, but do I miss something somewhere?

      clock_t start = clock();

      //calculate needed time

       clock_t end = clock();

You seem to be timing how long it takes the clock() function to return???

Also I think clock() is not very accurate, there are a lot of examples in the SDK of how to do good timing. Also if you were timing the right way, you would also be timing cudamalloc & cudafree. And you would be timing the first execution, which includes also other overhead. Check the SDK examples for how to do proper timing.

yes since this going to give me the execution time of the code between these two statements. Basically it’s a simple poor mans profiler.

end - begin = needed time for execution

But there is no code between these statements…

well I know but right now i don’t care much about the accuracy and more about the general functionality.

basically if I time a C function with this and if I time a cuda function i can expect 3 results

c is significant slower

cuda is significant slower

both are roughly the same

I don’t care about the kernel overhead since I’m interested in the total time of the program, not parts of it.

Its like you drive a truck and a sports car between on a race track and you want to measure what is faster. You take the time at the begin and at the end when they arrive. The time they take for certain parts is not important, since well the car could be twice as fast as the truck on the first half, but than it runs out of gas and is going to be pushed to the goal while the truck passes it and arrives at the goal first.

I could measure the time of the program with strace, but since osx has no strace, i use this little timer function.

i will now give the visual profiler a shot and see where the problem is.

(if it sounds arrogant, forgive me is a language barrier thing)

damn don’t write and drink,

let me check it again…


ok changed code but only difference is that c++ is 2x faster than cuda with the array operations.

ok tried this and well compiling fails so I seem todo something right.

thanks for this tip.

So now I know that it’s executed on the cpu, now i just need to figure out why it is so slow…

Well, if you need to run something only once, that takes so little time that the overhead of CUDA initialization is significant compared to the runtime, then it is not a program that you want to offload to the GPU. You want to use CUDA when :

  • you do something once, that takes a long time
  • you do something often, where the total time is long, but the individual times are short.

In the first case, it does not matter when you also time cudaMalloc & 1st kernel launch overhead. In the second case, when you time only 1 run (when you would normally have a lot of them) which includes the cudaMalloc times & 1st kernel overhead, you get a too pessimistic picture. Just as an example, I have a kernel that takes 150 microseconds to run, the first time I run this kernel I believe it takes something like 20 milliseconds because of the initialization overhead.

So 1000 runs will take in reality : 150e-6*999+20e-3 = 0.16985 seconds
If I only time the first kernel I would think it takes : 1000 * 20e-3 = 20 seconds

Why don’t you just copy & paste some code from the SDK, where the timing is done right? Then you can really say if what you are doing is faster on CPU or on GPU.

thanks again, I’m working on this right now and my complete idea with cuda is to provide a way to use the power of cuda from java --> jni --> cuda and so I’m still getting my feet wet with this.

Right now I’m just trying to figure out for what I could use it for and what not.

And I was just trying to make sure you get accurate info :biggrin:

Personally I use it from matlab to accelerate specific operations or sometimes almost complete programs (use matlab to load stuff, process completely in CUDA and use matlab again for the display of results).

I am not sure if it works the same with jni, but matlab keeps the connection with CUDA, so when calling CUDA 1000s of times in a row from matlab I do not need to cudaMalloc every time, I can just keep the pointers to GPU-memory that I cudaMalloc-ed the first time. And since the kernel code is still on GPU, it can run immediately without the overhead of running the first time.

I just need to make sure that I cudaFree the data when not needed anymore (but mostly that goes automatic, since when you detach from the CUDA context, GPU memory is automatically freed)

nice yeah we are not using matlab. We a basically want to write a chemical structure generator in the long run.

And i finally found an application where it works. Arrays are just to small, but in matrix operations, well cuda shines here nicely.

thanks for all of your help!

well, I’d like to emphasise E.D. Riedijk’s statements a bit more, as I’ve been in this field for a while now and as I have seen the same mistakes over and over again when reviewing scientific papers:

When you are doing “microbenchmarks”, whatever that means, timing only the kernel launch in a loop (leaving out the first couple of steps) is perfectly legitimate. It is perfectly fine to draw conclusions like “my GPU can do this and that kernel at a mind-boggling GFLOP/s rate of a trillion gazillion flops per second compared to these few flop/s for a similar kernel on the CPU”. The first important point is that only perf comparisons of a trusted, extremely optimised CPU implementation are trustworthy. Anything that relies on the compiler and has not seen the same amount of tuning than the GPU implementation (talking SSE, Altivec, etc here) is simply worthless.

And, second point, is that anything that remotely looks like an “application speedup measurement” MUST include all necessary transfers. Otherwise, people that read your reports and papers and use your cheesy novel algorithm in a larger setting will not be able to repro your results.

yes I agree with this,

as it turned i’m back to step one, my matrix example had an error and after all, the cpu is still 2x faster than cuda.

guess i need todo some more research.