Perf experiments to understand what the hardware can do

Multiplying larger and larger arrays of FP64 values takes the same time up until some point where the time increases. However, it isn’t what I expect First the results and then the code. The first number is the size of the array of numbers and the last number is the time in seconds. The time is for 1,000,000 executions. It is on a 4090. While looking consider a few questions. If there are only 16,384 CUDA cores then why does the time stay the same from before 16,384 fp64 values till well after that? Only at 262,144 multiples does it take significantly longer. Then after that the time doesn’t quite double(1.8x) for reasons I don’t understand. Once you’ve saturated the device then doubling the work should be at least 2X slower. Finally when going from 2097152 to 4194304 multiplies it takes 4.5 times as long. ???

8192: t1 * t2 took 2.081 16384: t1 * t2 took 2.095 32768: t1 * t2 took 2.066 65536: t1 * t2 took 2.057 131072: t1 * t2 took 2.209 Q1: Why still about 2 second with way over num of cuda cores? 262144: t1 * t2 took 2.991 524288: t1 * t2 took 5.989 2X slower which makes sense 1048576: t1 * t2 took 10.388 Only 1.7X slower which is a suprise given it is twice the work 2097152: t1 * t2 took 18.95 Q2: 1.8X slower but why ONLY 1.8X 4194304: t1 * t2 took 86.161 Q3: 4.5X slower for twice the work What is going on here?

import torch
import time
from datetime import datetime
from datetime import timedelta

with torch.cuda.device(0):
    dim1 = 256
    dim2 = 16
    while dim2 <= 16384:
        t1 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
        t2 = 1 + torch.rand((dim1,dim2), device='cuda', dtype=torch.float64)/10000
        i = 0
        tm0 =
        while i < 1000000:
            t1 = t1 * t2
            #torch.cuda.synchronize()  # MULT is dependent on previous result
            i += 1
        print(f"{dim1*dim2}: t1 * t2 took {round(timedelta.total_seconds(, 3)}")
        dim2 *= 2

I tried the code above and I was expecting 32,768 multiplies would take twice the time as 16,384 given the actual number of cuda cores.