i wrote a small test-program for a cuda device, for which i can specify the konfiguration of the used grid within some limits (number of threads, size of blocks, dimensions of the grid etc.)
and if it is possible i want to look at some metric of the gpu other than the resulting runtime of the test - for example number of context switches, cache hits/misses or similar values.
… do somebody know where to get information about such things?
As you implement your cache usage yourself, there can be now hardware counters for that. The problem with context switches is the invalidation of caches and the saving of caches and registers to some slower memory. However this does not occur here, as all concurrent threads/blocks have their separate registers (which can limit your occupancy of course).
Context switches in regard to switching warps would be nice to know, however I have several benchmarks that are not influenced in performance whether I run them with over a million threads or just a few thousand. So warp switching is pretty fast and shouldn’t be an issue as long as there are enough warps.