cuBLAS GEMM 2.5 times slower on 4090 than on 3090?

The following code of matrix multiplication of a 1024x4096 matrix with a 4096x64 matrix does 43 loops/s on 3090 (meaning per second it can do 42 times the whole U=600 inner for-loop), but on 4090 it does 17 loops/s. If I use half instead of float and cublasHgemm, then 3090 gets 90 loops/s and 4090 gets 185 loops/s.

I compiled it with “nvcc -lcublas -lcurand -arch=native test.cu -o test”

I rented the GPUs on cloud and 3090 had “Driver Version: 535.54.03, CUDA Version: 12.2”.
4090 had “Driver Version: 535.98, CUDA Version: 12.2”.

(Note that in this code we could combine the inner for loop into one big matrix multiplication, but the actual code this is inspired from does further calculation each time to change m2, so this is why the code is like this.)

Why is the 4090 2.5 times slower with fp32 than the 3090?

#include <iostream>
#include "cublas_v2.h"
#include <chrono>
#include <curand.h>

int main(void){
  cublasHandle_t cublasH;
  cublasCreate(&cublasH);
  const float alpha = 1.0;
  float *m1, *m2, *m3;
  int U = 600;
  cudaMalloc(&m1, sizeof(float)*1024*4096);
  cudaMalloc(&m2, sizeof(float)*4096*64*U);
  cudaMalloc(&m3, sizeof(float)*1024*64*U);

  curandGenerator_t prng;
  curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_XORWOW);
  curandGenerateUniform(prng, m1, 1024 * 4096);
  curandGenerateUniform(prng, m2, 4096*64*U);
  curandGenerateUniform(prng, m3, 1024*64*U);

  for(uint j=0; j < 50; j++){
    auto t1 = std::chrono::high_resolution_clock::now();
    int repeat = 50;
    for(int i = 0; i < repeat; i++){
      float *w1=m1,*w2=m2,*w3=m3;
      for(int m=0; m<U; m++){
        cublasSgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, 1024, 64, 4096, &alpha,
          w1, 1024, w2, 4096, &alpha, w3, 1024);
        w2 += 4096*64;
        w3 += 1024*64;
      }
    }
    cudaDeviceSynchronize();
    auto t2 = std::chrono::high_resolution_clock::now();
    double t = std::chrono::duration<double>(t2-t1).count();
    std::cout << "Time " << t << " loops/s " << repeat/t << std::endl;
  }
}