Performance Problem (+ simple understanding of parallel configuration GPU)

Hi,

i just wanne discribe our problem i a few words:
we wanne make a kernel running parallel on cuda. the kernel still running very finde but much to slow. more slower than on our CPU. we use a geforce GTX 560Ti to run our kernel. so there schould be much potential to be faster than our AMD Dual-Core-CPU.
we use the following configuration to play with the max. numbers of threads our GPU is managing.

cuLaunchKernel(	process,				//Kernel to launch
							1, 						//gridDimX 	- Width of grid in blocks
							1, 						//gridDimY 	- Height of grid in blocks
							1, 						//gridDimZ 	- Depth of grid in blocks
							1, 						//blockDimX - X dimension of each thread block  (z.B.: WORK_SIZE) // Total number of active threads
							1, 						//blockDimY - Y dimension of each thread block
							1, 						//blockDimZ - Z dimension of each thread block
							0, 						//sharedMemBytes 	- Dynamic shared-memory size per thread block in bytes

so in our case we put up the blockDimx up to 512 to run our application in parallel.
in this forum there are often discriptions including warps, blocks and grids. what does this in our case means?
how many threads we are able to run in parallel with our Geforce GTX 560Ti?

we also take the time running our application on CPU (2s) and GPU (152s). so we think it’s not running parallel. could this problem cased with our configuration of gridDim and blockDim?

thanks for reply.

If I understood correct the comments you are launching only 1 block with 1 thread. No parallelism.

no, by default we use

cuLaunchKernel(	process,				//Kernel to launch
							393, 						//gridDimX 	- Width of grid in blocks
							393, 						//gridDimY 	- Height of grid in blocks
							1, 						//gridDimZ 	- Depth of grid in blocks
							64, 						//blockDimX - X dimension of each thread block  (z.B.: WORK_SIZE)
							1, 						//blockDimY - Y dimension of each thread block
							1, 						//blockDimZ - Z dimension of each thread block
							0, 						//sharedMemBytes 	- Dynamic shared-memory size per thread block in bytes

Then you need to put more information here. The parallelism is put in by the programmer explicitly. In the kernel yo have the thread id and you set each thread to do a part of the problem, usually 1 thread 1 (or 29 elements of the matrix.

we have a data-array with the gridDimX 393 und gridDimY 393. every single thread take a specified part of this array to do it’s own work (of course with the same kernel).
in our example (step upper) we take 64 threads to run parallel at the same time. (we tested it up to 512 threads in parallel). …i don’t know exactly, but shouldn’t geforce GTX 560Ti run more than this 512 threads in parallel? this card has about 8 SM and i think each SM should run these 512 threads in parallel? korrekt?
the second confused problem is, that there is no significant different in worktime, if we set up higher the parallel level of threads. in our example there is nearly no difference between 64 and 512 threads in parallel.
what we could make wrong?

Some codes show no difference when you change the number of threads per block. Unless you give a little more details it is difficult to say. I am not even sure you are running kernels in the correct way.

what do you mean with: “I am not even sure you are running kernels in the correct way.”
can you explain me?

Until I see the code I can not say what is the problem.