The shape of the kernel (Ex 3x3 vs 7x7) directly affects the hyper-parameters we selected for training. If you select larger kernel first, you intend to capture larger spatial features first hence the depth of the network (number of layers) is reduced, but remember it has a cost of losing small important features at the very beginning.