Experimentally determining the number of concurrent threads


I need to determine the number of concurrent threads on a GPU.
I did some calculations and some experiments and there are some discrepancies regarding calculated and obtained results.
The purpose of this is to determine the size of the problem that I am solving that can be calculated in one (instantaneous) “kernel code call” without block/threads replacement.

For the number of concurrent threads on my GPU (930m - compute capability 5.0), if I understand it correctly, is
SMCountwarpCountthreadCount = 36432 = 6144

I suppose the algorithm should take the same time for any number of threads lower than 6144.

Having put together an extremely simple benchmarking code,

#include "cuda_runtime.h"
#include <fstream>
#include <string>

__global__ void dummyCall()
	for (int i = 0; i < 1000;)

void getTimes()
	const unsigned replicationsCount = 20;
	std::ofstream dataFile;

	for (int i = 1; i < 7000; i++) // i=threadCount
		float timesSum = 0;
		for (unsigned replications = 0; replications < replicationsCount; replications++)
			cudaEvent_t start, stop;

			dummyCall << <ceil((float)i / 1024), 1024 >> > (); // 1024 is the number of threads per block according to cuda occupancy	

			float ms = 0;
			cudaEventElapsedTime(&ms, start, stop);
			timesSum += ms;

		dataFile << i << ';' << timesSum / replicationsCount << std::endl;


int main()
	return 0;

The results are not as I expected.
I am getting the same execution time for threadCounts <= 3072, from then on the execution time is doubled.

Is there anything I am missing or could someone explain this phenomenon to me? I would be very thankful.

Thanks, Have a nice day.

For all compute capabilities between 2.0 3.0 and 7.0 inclusive, the maximum thread carrying capacity is 2048*number of SMs (Turing/cc7.5 lowered this multiplier to 1024). Whether or not a particular code can actually meet the theoretical maximum of 2048 per SM is a matter of occupancy, which you can google and read about.

Your dummyCall kernel code will be optimized away to an empty kernel if you are compiling non-debug code. This will possibly give you results different than you expect.

For grid-sizing of arbitrary kernels, you’re better off using the CUDA occupancy API. That is its purpose. Again, google is your friend.

So my formula is totally wrong.
I’ve found it somewhere on stack overflow and it kinda made sense to me so I stuck with it.
Indeed now as I understand it a bit more, I checked the CUDA Guide myself and confirmed to myself what you just said.

As for the rest of your post, thank you very much. It is helpful.

I’m not sure why you think so. Don’t both methods produce the same number?

3*2048 = 6144

36432 = 6144


Because according to cuda documentation, my formula would yield for cc2.1: SMCOUNT*48(warps/SM)32(thrds/warp) which is not equal to what you said because 3248 != 2048, if I understand correctly.

Actually, once again, it is you that is correct and I that am mistaken. I’ve revised my previous statement, the correct data is here:


cc2.x has a maximum of 1536 threads per SM, not 2048. (I had forgotten, it’s been a while since I used Fermi devices)