I’m running on a Tesla T10 device, and I think I remember seeing that there is a maximum number of threads that can be invoked for a kernel call. Is this true? I am basically trying to figure out how hard I can push the Tesla T10. I know the maximum dimensions of a grid, thread block, etc., but I think that 24,000 blocks at 256 threads per block = 6,144,000 total threads will always fail. Is this correct?
EDIT: I’ve tested and so far 2,304,000 threads still works. Somewhere between there and 6,144,000 it breaks, and I wonder if that is the cause.
No it isn’t correct. You can theoretically have 512 * 65335 * 65335 = 2198956147200 threads per kernel launch (although the actually per block limit can be somewhat less depending on kernel register and shared memory usage).
No it isn’t correct. You can theoretically have 512 * 65335 * 65335 = 2198956147200 threads per kernel launch (although the actually per block limit can be somewhat less depending on kernel register and shared memory usage).
Hmmm, well, that’s the only thing I can see at this moment that would make it so that my kernel fails to launch. I will keep experimenting to see if I can find more variables that play into my kernel’s demise. Thanks for the info, avidday.
Hmmm, well, that’s the only thing I can see at this moment that would make it so that my kernel fails to launch. I will keep experimenting to see if I can find more variables that play into my kernel’s demise. Thanks for the info, avidday.
This program takes about 1-1.5min to pre-load data, then it launches the kernel. Without the recent modifications I have made, I can run the full data set using a different kernel scheme (same algorithm, though), and it will run fine. The difference is that now it is a much more bite-size problem whereas before I was making each thread do too much work. So, the only difference I can see is the increase in number of threads.
This program takes about 1-1.5min to pre-load data, then it launches the kernel. Without the recent modifications I have made, I can run the full data set using a different kernel scheme (same algorithm, though), and it will run fine. The difference is that now it is a much more bite-size problem whereas before I was making each thread do too much work. So, the only difference I can see is the increase in number of threads.
Ok, another shot in the dark: Is your execution path data dependent? If you vary the number of threads through varying the amount of data you read, there might be something in the data that triggers an unusual execution path. E.g., unbalanced __syncthreads() calls.
Of course, this is as much speculation as my previous post.
Ok, another shot in the dark: Is your execution path data dependent? If you vary the number of threads through varying the amount of data you read, there might be something in the data that triggers an unusual execution path. E.g., unbalanced __syncthreads() calls.
Of course, this is as much speculation as my previous post.
That’s a good thought, but I don’t vary the amount of data by limiting the number of threads. I hard code the number of threads I want to run per block. I have a .txt file that tells me which data set to read from. When my data set is reading from 1500 sets of text files, I invoke a kernel with about 3 million threads. The next set, 1750, takes about 3.6 million threads. For 2000, 4.1 million threads. As of a few minutes ago, this data set would not run. There was always a fail from one of my cudaMemcpys from Host to Device, or so it seems. I really do think that my GPU device has an attitude with me on certain days.
Just now ran my full data set. It worked on the first try. Usually it fails on the first try. I don’t get it.
That’s a good thought, but I don’t vary the amount of data by limiting the number of threads. I hard code the number of threads I want to run per block. I have a .txt file that tells me which data set to read from. When my data set is reading from 1500 sets of text files, I invoke a kernel with about 3 million threads. The next set, 1750, takes about 3.6 million threads. For 2000, 4.1 million threads. As of a few minutes ago, this data set would not run. There was always a fail from one of my cudaMemcpys from Host to Device, or so it seems. I really do think that my GPU device has an attitude with me on certain days.
Just now ran my full data set. It worked on the first try. Usually it fails on the first try. I don’t get it.
This error goes back to something I posted a few weeks ago. Depending on how large my data set is, it often fails the first time I run the program. Then, the second time it works perfectly. For example, when I have my matrix set to take in 2000 sets of data (so, the output matrices are 2000 x 2000 in size), then the program either fails during the first of my many cudaMemcpy() calls from Host to Device, or I get an “unspecified kernel launch failure”. Currently, that 2000 data set is running right now. It wouldn’t work for me a few minutes ago at all.
This error goes back to something I posted a few weeks ago. Depending on how large my data set is, it often fails the first time I run the program. Then, the second time it works perfectly. For example, when I have my matrix set to take in 2000 sets of data (so, the output matrices are 2000 x 2000 in size), then the program either fails during the first of my many cudaMemcpy() calls from Host to Device, or I get an “unspecified kernel launch failure”. Currently, that 2000 data set is running right now. It wouldn’t work for me a few minutes ago at all.
There is no way that sort of intermittent problem has anything to do with execution arguments (unless you are really close to the watchdog timer and additional work is pushing it over the edge sometimes).
If I were to guess, I would say you have some out of bounds memory access which is randonly hosing something in the GPU and leaving it in a parlous state, or you have a hardware problem.