Maximum Threads for Kernel Call

zenosparadox · May 24, 2010, 7:30pm

Hi Everyone.

I’m running on a Tesla T10 device, and I think I remember seeing that there is a maximum number of threads that can be invoked for a kernel call. Is this true? I am basically trying to figure out how hard I can push the Tesla T10. I know the maximum dimensions of a grid, thread block, etc., but I think that 24,000 blocks at 256 threads per block = 6,144,000 total threads will always fail. Is this correct?

EDIT: I’ve tested and so far 2,304,000 threads still works. Somewhere between there and 6,144,000 it breaks, and I wonder if that is the cause.

Thanks,

Daniel

avidday · May 24, 2010, 7:44pm

No it isn’t correct. You can theoretically have 512 * 65335 * 65335 = 2198956147200 threads per kernel launch (although the actually per block limit can be somewhat less depending on kernel register and shared memory usage).

avidday · May 24, 2010, 7:44pm

No it isn’t correct. You can theoretically have 512 * 65335 * 65335 = 2198956147200 threads per kernel launch (although the actually per block limit can be somewhat less depending on kernel register and shared memory usage).

zenosparadox · May 24, 2010, 7:57pm

Hmmm, well, that’s the only thing I can see at this moment that would make it so that my kernel fails to launch. I will keep experimenting to see if I can find more variables that play into my kernel’s demise. Thanks for the info, avidday.

zenosparadox · May 24, 2010, 7:57pm

Hmmm, well, that’s the only thing I can see at this moment that would make it so that my kernel fails to launch. I will keep experimenting to see if I can find more variables that play into my kernel’s demise. Thanks for the info, avidday.

tera · May 24, 2010, 8:07pm

May it have to do with runtime and the watchdog timer kicking in?

tera · May 24, 2010, 8:07pm

May it have to do with runtime and the watchdog timer kicking in?

zenosparadox · May 24, 2010, 8:15pm

This program takes about 1-1.5min to pre-load data, then it launches the kernel. Without the recent modifications I have made, I can run the full data set using a different kernel scheme (same algorithm, though), and it will run fine. The difference is that now it is a much more bite-size problem whereas before I was making each thread do too much work. So, the only difference I can see is the increase in number of threads.

zenosparadox · May 24, 2010, 8:15pm

This program takes about 1-1.5min to pre-load data, then it launches the kernel. Without the recent modifications I have made, I can run the full data set using a different kernel scheme (same algorithm, though), and it will run fine. The difference is that now it is a much more bite-size problem whereas before I was making each thread do too much work. So, the only difference I can see is the increase in number of threads.

tera · May 24, 2010, 8:33pm

Ok, another shot in the dark: Is your execution path data dependent? If you vary the number of threads through varying the amount of data you read, there might be something in the data that triggers an unusual execution path. E.g., unbalanced __syncthreads() calls.

Of course, this is as much speculation as my previous post.

tera · May 24, 2010, 8:33pm

Ok, another shot in the dark: Is your execution path data dependent? If you vary the number of threads through varying the amount of data you read, there might be something in the data that triggers an unusual execution path. E.g., unbalanced __syncthreads() calls.

Of course, this is as much speculation as my previous post.

avidday · May 24, 2010, 8:40pm

What is the exact error you are getting? cudaErrorLaunchFailure or something else?

avidday · May 24, 2010, 8:40pm

What is the exact error you are getting? cudaErrorLaunchFailure or something else?

zenosparadox · May 24, 2010, 8:46pm

That’s a good thought, but I don’t vary the amount of data by limiting the number of threads. I hard code the number of threads I want to run per block. I have a .txt file that tells me which data set to read from. When my data set is reading from 1500 sets of text files, I invoke a kernel with about 3 million threads. The next set, 1750, takes about 3.6 million threads. For 2000, 4.1 million threads. As of a few minutes ago, this data set would not run. There was always a fail from one of my cudaMemcpys from Host to Device, or so it seems. I really do think that my GPU device has an attitude with me on certain days.

Just now ran my full data set. It worked on the first try. Usually it fails on the first try. I don’t get it.

zenosparadox · May 24, 2010, 8:46pm

That’s a good thought, but I don’t vary the amount of data by limiting the number of threads. I hard code the number of threads I want to run per block. I have a .txt file that tells me which data set to read from. When my data set is reading from 1500 sets of text files, I invoke a kernel with about 3 million threads. The next set, 1750, takes about 3.6 million threads. For 2000, 4.1 million threads. As of a few minutes ago, this data set would not run. There was always a fail from one of my cudaMemcpys from Host to Device, or so it seems. I really do think that my GPU device has an attitude with me on certain days.

Just now ran my full data set. It worked on the first try. Usually it fails on the first try. I don’t get it.

zenosparadox · May 24, 2010, 8:49pm

This error goes back to something I posted a few weeks ago. Depending on how large my data set is, it often fails the first time I run the program. Then, the second time it works perfectly. For example, when I have my matrix set to take in 2000 sets of data (so, the output matrices are 2000 x 2000 in size), then the program either fails during the first of my many cudaMemcpy() calls from Host to Device, or I get an “unspecified kernel launch failure”. Currently, that 2000 data set is running right now. It wouldn’t work for me a few minutes ago at all.

zenosparadox · May 24, 2010, 8:49pm

This error goes back to something I posted a few weeks ago. Depending on how large my data set is, it often fails the first time I run the program. Then, the second time it works perfectly. For example, when I have my matrix set to take in 2000 sets of data (so, the output matrices are 2000 x 2000 in size), then the program either fails during the first of my many cudaMemcpy() calls from Host to Device, or I get an “unspecified kernel launch failure”. Currently, that 2000 data set is running right now. It wouldn’t work for me a few minutes ago at all.

tera · May 24, 2010, 8:59pm

That sounds like an issue with the power supply then.

tera · May 24, 2010, 8:59pm

That sounds like an issue with the power supply then.

avidday · May 24, 2010, 9:07pm

There is no way that sort of intermittent problem has anything to do with execution arguments (unless you are really close to the watchdog timer and additional work is pushing it over the edge sometimes).

If I were to guess, I would say you have some out of bounds memory access which is randonly hosing something in the GPU and leaving it in a parlous state, or you have a hardware problem.

Topic		Replies	Views
Wrong output when adding blocks what am I doing wrong? CUDA Programming and Performance	13	12074	December 4, 2007
Fewer than MaxThreads per Block Fails (Code Included) 400 threads per block fails while 300 successf CUDA Programming and Performance	9	1179	March 24, 2011
How to debug kernel throwing an exception? CUDA Programming and Performance	16	8084	June 14, 2013
kernel works on Gtx280/295/480 but not on C2050 unspecified launch failure CUDA Programming and Performance	38	3129	September 23, 2010
Are there memory limitations on Device when using large arrays? Tesla C1060 CUDA Programming and Performance	40	15028	April 22, 2009
Launch Timeouts CUDA Programming and Performance	32	22055	May 4, 2011
Very strange behaviour. Maybe a bug...? Kernel fails to run strangely, but no errors are reported. CUDA Programming and Performance	5	1108	May 13, 2009
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13426	January 24, 2009
Unspecified launch failure error CUDA Programming and Performance	10	19421	January 6, 2018
TOO MANY RESOURCES REQUESTED FOR LAUNCH CUDA Programming and Performance	16	11987	September 2, 2008

Maximum Threads for Kernel Call

Related topics