pycuda setting parallel threads, operating like sequential

Hi !

Newbie here !

Ok so i have a small issue at running a PYcuda system in which i utilize 3D blocks to implement the following loop

for (ix=0;ix<40;ix++)
 for (iy=0;iy<56;iy++)

now i imagined that when i shift this into cuda i would be able to get 17920 parallel threads that would be able to compute my function in the time it takes to compute 1 somefunctionhere() as each block is independent of others. The problem is that if the function takes 0.1 seconds to implement 1 iteration then the whole block takes 9ish seconds to complete, i.e. some parts are getting parallel but not all.

The following is how i set my grid & block, also how i set the loop itself. please tell me what i m doing wrng here

const int ix = blockIdx.x * blockDim.x + threadIdx.x; //40
	const int iy = blockIdx.y * blockDim.y + threadIdx.y; //56 
	const int iz = blockIdx.z * blockDim.z + threadIdx.z; //8
mymain(drv.Out(dest), drv.In(a), drv.In(RF), block=(10,7,2), grid = (4,8,4))

also when i tryo to put block values directly and use grid as 1 it gives me out of resource error.

I am using 1050ti btw

for further clarification this is what i am asking

somefunctionhere() = > 1 module

and i am running 17920 instances of this module

so the total number of clock cycles to run 1 module = clock cycles to run 17920 modules

is theory correct ? and if so why does my total time is increasing(almost linearly) as i go from 1->17920 (parallel) modules.

CUDA doesn’t necessarily run an arbitrarily large number of threads in parallel. A GPU has a certain amount of parallel processing capability. Once that is saturated, exposing additional work (more threads) doesn’t necessarily result in speed up - things just take longer.

The fact that I have to explain this over and over again perplexes me. The underlying thesis in your question is that one can spin up an arbitrarily large number of threads and the GPU will process each thread at exactly the same rate, concurrently, in parallel. When spelled out that way, the thesis is absurd. There is no processor on the planet (nor will there ever be one) that behaves that way.

At some point, when you create enough threads, the machine becomes saturated.

Now, You’re suggesting that one pass through “somefunction” takes 0.1s. If so, then a purely serial implementation of your workload would take 17920 * 0.1s = 1792s or approximately 30 minutes. Your GPU is processing that workload in 9s. Seems like an interesting speed-up to me.

If you were wanting or expecting more, the right approach is to learn about GPU performance optimization, learn how to use a profiler, and get started on “analysis-driven optimization”. You can find plenty of material on this topic with a bit of google searching.

thanks for the detailed reply i will definitely look at the stuff you pointed out. One last thing is that i understand that the gpu cant truly run 17920 parallel threads at a given time but shouldn’t the GPU at least run parallel threads until it reaches its saturation point which i assume is SM*2048 = 12.3K. or do the parallel threads get broken well before this limit is reached?

Assuming you haven’t run into any other occupancy constraints, yes the instantaneous thread capacity of the GPU is 2048 * # of SMs

However, this doesn’t mean that 12K threads run with exactly the same speed and efficiency of 1 thread. If your problem is memory bound, for example, then all of this is pretty meaningless, and the question that would need to be investigated would be the memory usage of the algorithm as well as the available memory bandwidth of the GPU. A single thread might run without any actual constraints due to memory bandwidth. But when you run 12K threads, available memory bandwidth could enter the picture.

That’s just one possibility. It’s not possible to describe the performance limiters of a code you haven’t shown. Some of your thinking is flawed, in my opinion. If you learn how to do profiler analysis of your code, you’ll learn a lot about these things.