Should I use low gridDim/blockDim launch config when the GPU is shared between users?

I’m writing some library code in CUDA. In my use case, many users share a GPU (professional GPU like Quadro/Tesla/A30/…), or 1 user launch many processes that use the GPU. They’re used to run CUDA in C++ programs.

For most kernels, there are diminishing return when using high gridDim/blockDim. For example, going from 80/128 → 160/256 can 4x the performance, while 160/256 → 160/512 or 320/256 give either little (like 10-20%) or even degrade the performance.

But assuming increasing gridDim/blockDim does improve performance, should I always launch kernels using as much resources as possible, or only just enough resources? I worry that if I use the first option, then performance will degrade when too many processes use the GPU at the same time.

In your experience, what would you do in this case? I wonder how thrust solve this resource problem when there’re multiple users/processes.
Thank you.

I doubt it is going to make any difference with respect to multiple users.

Where there are multiple independent users, kernel activity is serialized between users. The GPU performs a context-switch from user A to user B, when switching from running kernel(s) launched by user A to kernel(s) launched by user B.

Oh, this is a really important good news. By “users”, do you mean Linux login users, or different processes? What about kernels launched from 2 threads of 1 process?

For example, do I need to worry about other processes ruining each other L1 cache and reduce overall performance?

How did you mean it?

I mean processes. Well both, actually.

Let’s say many scientist shares a machine. And each scientist launch multiple processes that use GPU differently.

In this case, can I assume that from a process’ POV, it has access to all GPU resources (aside from global memory) and act as if it’s the only one using the GPU? For example, if process 1 is using shared memory + L1 cache, then it gets context switch to process 2, what happens to the shared memory + L1 cache state in process 1?

with respect to processes from separate “linux login users” or “scientists”, my statements apply.

with respect to processes from the same user, my statements apply, unless MPS is in use by that user.

my statements don’t apply to threads belonging to the same process.

For the multi-process case without MPS, you can expect that each process runs in its own “space”, and any data from one space is not visible to another space.

The context switching mechanism is not specified in detail by NVIDIA anywhere that I am aware of. The only reasonable expectation I have is that context switching does not introduce the possibility for defective behavior. Your code should still run and give a reasonable result.

Okay, thanks! Then I can safely assume that a process should act as if it’s the only one using the GPU, and use as many blocks/threads in a kernel launch as needed.