Calculate Thread index with Grid dim and Block dim

Hello,

i am a great newbie with Cuda and i will start soon my first tests using pycuda with a Jetson Nano.

I need your help to clarify an important point : how calculating Thread index

My tests will be very simple. Mainly, i will make calculus on monochrome image (that is to say a matrix), probably 1 dimension matrix (maybe 2 dimension matrix).

From what i have understand (maybe) :

Jetson Nano gets 128 Cuda cores with single GPU.

Maximum number of threads per multiprocessor : 2048

That is to say maximum threads number will be 2048.

Then, i got to choose :
Number of blocks (one or 2 dimensions)
Number of threads per block (warp size is 32 so i must choose 32, 64, 128 etc. threads per block)

As i will first work on one dimension array (image), i have to calculate idx.

If i choose 1 block and N threads per block :
unsigned int idx = threadIdx.x

If i choose M blocks one dimension (M,1) and N threads per block :
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x

If i want to work with a 2 dimensions grid :
unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x
unsigned int idy = threadIdx.y + blockIdx.y * blockDim.y

Are those things correct or not ?

Many thx for your help

Alain

mostly correct.

Number of threads per block (warp size is 32 so i should choose 32, 64, 128 etc. threads per block)

Yes

You can choose 2048 but you can also choose a number higher than 2048, and it will work fine.

Hello Robert,

many thanks for your reply.

Cuda is really interesting. I used to think “CPU” and “GPU” thinking is quite different but it will brings me very new opportunities.

I think i will be back very often to ask newbie questions !

Have a nice day.

Alain

I’m back !

My first pycuda program works great. Champagne !

Profit!

If we want to run the Nano as efficiently as possible, should we aim to run 2048 threads exactly (i.e. 2 blocks of 1024 threads each)?

Run timing tests to see how the performance varies. They should provide you with an answer. Preferably the test should be with your program or one similar to it because the answer will likely depend on the nature of your problem and what its limiting factors are.

Okay, I’ll do some tests.

I assumed 1024 was best since that was what the official sample code used, however the specs say 2048 threads is possible.