What's not clear form the article, and not indicated in the code samples, is what device is actually being used in each case. The PC is clearly using the cuda device, but is there any optimization at all on the M1 machine? At the very least it should use the tensorflow-metal plugin (and OctoML is claiming a 3x performance boost over that via dedicated Metal optimization). If the test is just unoptimized M1 CPU against RTX+Cuda then the results are totally explainable (it would be like running the current best performing Metal-optimized TF on the Mac and x86 CPU on the PC).
Not judging the author's integrity, btw, just curious about the software side, as the new Apple hardware is very much a moving target when it comes to ML optimization in TF and PyTorch.