Concurrent kernels and textures

I would like to know where more information can be found w.r.t. the maximum number of textures that can be used when trying to run kernels concurrently.

This http://on-demand.gputechconf.com/gtc-express/2011/presentations/StreamsAndConcurrencyWebinar.pdf presentation states that

First of all, this dates back to 2011. Is that still accurate with Cuda 6.0 and Kepler/Maxwell?

I can interpret that in a few ways.

Is it that the sum of the textures used by all the N kernels that want to run cannot be more than 8? (Say kernel A used 6 textures and kernel B uses 2 textures, A and B could run concurrently; but kernel C using 7 textures and kernel D using 3 textures would not run)

Or that each of the N kernel cannot, on their own, use more than 8 textures? (say kernel A uses 7 textures and kernel B uses 7 textures, they could run concurrently)

Or is it the total number of textures currently bound within a context? (Say kernel A uses 3 textures and kernel B 3 textures and kernel C 3 textures, no two kernels could run concurrently because the total number of bound textures is 9).

My situation is that I run N streams. Those N streams do the same overall task, and that task uses 3 kernels in a host loop. 2 textures are shared by all 3 kernels. Each stream independently calls cudaBindTexture. There are also a number of non-shared textures : 6 for kernel A, 4 for kernel B and the same 4 for kernel C. Again, each stream calls cudaBindTexture independently for all textures.

The profiler tells me kernels are not running concurrently, and wall time would seem to agree. I’m trying to figure out if it is due to the number of textures used. This is on a K20.

Also, a nice option would be for “something” to report why kernels were not able to run concurrently.

Thanks!

There are some compute capability constraints/ limitations/ specifications pertaining to textures; but none really seem relevant to your particular dilemma

Kernel concurrency resolve with textures should be the same as kernel concurrency resolve without textures, just with textures as an additional factor/ consideration
Thus, the fact that your kernels fail to run concurrently may just as well have nothing to do with your textures

Factors that I know of that can prevent concurrency are memory utility (shared and also local), kernel grid/ block dimensions, and apparently also texture sizes
Therefore:
How much shared memory are you using per kernel?
What are your kernels’ grid/ block dimensions?
How big are the textures, respectively?

The grid size varies over the course of the execution, but it can be as small as 1 block. This is in fact the reason that pushed me towards investigating concurrent kernels. I want these very inefficient kernel launches to run concurrently with the corresponding inefficient kernel launches from the other streams.

Block size is 128.

A tiny amount (<100 bytes usually) of shared memory per kernel is used.

The textures that I have listed in my original message are mostly 2D textures of float of size around 2000x3.

“The grid size varies over the course of the execution, but it can be as small as 1 block.”

What is the maximum/ greatest grid size of any kernel per (running in each) stream?

“The textures that I have listed in my original message are mostly 2D textures of float of size around 2000x3.”

Individually, or collectively? What is the maximum total/ collective texture memory size that any kernel in a stream access at a time?

Well, I have found my mistake!

I use a set of 3 integer variables (with one copy of the variable in device and another copy in host memory, so 6 integers total) to drive the simulation. Turns out that I had not allocated the host version of those variables using cudaMallocHost, and it screwed everything.

On the bright side, I can’t complain that this wasn’t documented! :)

W.r.t. to my initial query, I can attest that even if a kernel uses 8 textures, it can run concurrently with other kernels that use >0 textures.

Thanks for your help little_jimmy!