Why use many small convolutional kernels, like 3x3, instead of a few large ones?

The VGGNet paper explains this in great detail. There are two main reasons for this: To begin, you can acquire the same receptive field and capture more spatial context by utilising multiple smaller kernels rather than a few large ones, but smaller kernels require fewer parameters and calculations. Second, because you’ll be employing more filters with smaller kernels, you’ll be able to apply more activation functions, resulting in a more discriminative mapping function learned by your CNN.