Pascal & Turing string pattern matching benchmarks differ noticeably

Hi! I am performing an evaluation of an optimization technique named partial evaluation for GPU applications with respect to naive multiple pattern matching program. I have 2 GPUs available for benchmarking: Tesla T4 (Turing) and GTX 1070 (Pascal). I have 4 implementations of string pattern matching each utilizing different memory space to store the patterns to match: global, constant, shared and one utilizing the partial evaluation to “store” the patterns. While the optimization technique behaves similarly for both GPUs, other implementations perform differently (sorry for the links, but I am not allowed to post images):

The kernels are the same up to the choice of the memory space, e.g. kernel with patterns in global memory is:

__global__ void match_multy(char* patterns, int* p_sizes, int p_number, char* text, long text_size, char* result_buf) {

    long t_id = threadId();

    if(t_id < text_size){
        int p_offset = 0;
        int matched = 1;
        
        result_buf[t_id] = 0;

        for(int i = 0; i < p_number; i++) {//for each pattern
            matched = 1;
            if(t_id < text_size - p_sizes[i] + 1) {
                for(int j = 0; j < p_sizes[i]; j++) {
                
                    if(text[t_id + j] != patterns[j+p_offset]) {
                        matched = -1;
                        break;
                    }
                } 
            
                if(matched == 1) {
                    result_buf[t_id] = i+1; // 0 stands for missmatch
                }
            }
            p_offset += p_sizes[i];
        }             
    }
}

i.e. all the patterns are stored in one dimensional char array.

The aim of the partial evaluation here is to embed the patterns into the code and reduce the number of memory transactions like that:
https://imgbb.com/
https://imgbb.com/
Also partial evaluation unrolls the loops like #pragma unroll would do.

The execution time includes only the kernel run time and has been captured with multiple runs of the applications through nvprof and taking the average.

The assumption why partially evaluated version is the fastest one is the following:
As it could be seen from the kernel the access for string patterns is not optimal: global memory fetches redundant data and has troubles with coalescing, constant memory would not broadcast due to missmatches and shared memory faces bank conflicts. Since partial evaluation avoids this transaction overhead for patterns, its performance is the fastest.

The assumption for GXT 1070 seems to be validated by the following memory statistics charts from nvvp:

As it can be seen, partially evaluated version has the lowest number of memory access instructions and lowest amount of actual data transferred between L1 and L2 caches.

However, when running a profiler for Tesla T4 (nsight compute for now, since nvvp simply does not work with cc 7.x and nsight compute is not working for GTX 1070) memory charts are different:

So now I am ready to post my questions:

  1. Is the assumption mentioned above makes any sense and seems correct?
  2. Is the difference in the number of memory access instruction (and amount of actual data transferred) for shared and constant memory caused by bank conflicts for shared mem? (Surprisingly, nsight compute report 0 bank conflicts, although 4 consecutive chars of a pattern resides in the same bank )
  3. Why could actual data transfer reported by nsight compute for Tesla T4 differ from one reported by nvvp for GTX 1070?
  4. Why global memory version performance for GTX 1070 differs a lot from the one for Tesla T4 and in general why the relation between plots could be so different for both cards while optimized version has the same running time? I would add some clarification and subquestions for the last question:
    • The applications were compiled without any flags and with ECC enabled, hence global memory accesses shoud be cached in L2 for GTX 1070 and in L1 for Tesla T4. The patterns and the subject string are read only and hence should be accessed through read only data cache for both cards (which should be L1 cache if I am not mistaken). However I don't know how to verify whether the strings are accessed via read only cache or not: https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/analysis/report/cudaexperiments/kernellevel/memorystatisticsglobal.htm here it is said that "To verify which data path is used for individual memory accesses inspect the Source View page. Global read only memory accesses will show up as LDG instructions; global memory reads through the data cache are represented as LD instructions." but I am confused about either LDG means data is read through read only cache or LD. If the former is true, then both Tesla T4 and GTX 1070 access strings through L1, and due to Turing architecture Tesla's L1 outperforms the L1 for Pascal. Thus, it appears that L1 cache of Tesla T4 is comparable with constant and shared memory performance given bad access. However that does not explain why constant and shared memory have the same performance on Tesla, furthermore, if it has been all about L1 cache performance, partially evaluated version on Tesla would have been faster than one on GTX 1070 (since it still accesses the subject string in global memory), but they have equal execution time on the plots.