Unexpected CUDA processing time dependency on thread count

When calling a kernel function, the number of threads per block should ideally be a multiple of the warp size. This yields more efficient use of resources and lower processing times. However, there seems to be another factor that periodically decreases the processing time. Shown below, the processing time gets offset every 32 threads, while an additional speed-up takes place at every multiple of 11 threads per block. What are the reasons that could lie behind this behaviour?

The GPU in question is the GeForce GT 730, running the kernel function attached at the bottom of this post. For timing purposes, it gets invocated in a loop using:

kernel_generate_image[(16,16),(1,i+1)](px, 32)

where px = np.zeros([1024,1024])

def kernel_generate_image(image, T):

    # Calculate the thread's absolute position within the grid
    x = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
    y = cuda.threadIdx.y + cuda.blockIdx.y * cuda.blockDim.y

    # Set stride equal to the number of threads we have available in either direction
    stride_x = cuda.gridDim.x * cuda.blockDim.x
    stride_y = cuda.gridDim.y * cuda.blockDim.y

    for i in range(x, image.shape[0], stride_x):
        for j in range(y, image.shape[1], stride_y):
            image[i, j] = (sin(i*2*pi/T+1)*sin(j*2*pi/T+1)*0.25)