Luke Byrne - Authorea

In this study we systematically investigate a range of nonlinear functions in the fully-connected feed-forward portions of Vision Transformer (ViT) machine learning models. We limit our investigation to only GLU-type nonlinear functions, as the GLU-type function SwiGLU is the current standard in state-ofthe-art language and vision models. We identify 21 candidate layer functions with 1, 2, or 3 weight matrices, utilizing the activations: Sigmoid, Tanh, and Sin. ViT-Tiny models are implemented utilizing these functions and benchmarked on the image classification datasets CIFAR10, CIFAR100, and SVHN. Through these experiments we identify several previously uninvestigated functions which consistently outperform SwiGLU on all benchmarks. The most performant of these functions we call SinGLU. We further benchmark SwiGLU and SinGLU on the image classification dataset Imagenet64, and again SinGLU performs better. We note that periodic functions such as Sin are not common in neural networks. We perform a numerical investigation into sinusoidally activated neurons and suggest that their viability in Vision Transformers may be due to the losslandscape smoothing of Layer Normalization, Multi-Headed Self Attention (MHSA), and modern data augmentations such as label smoothing. However, our experiments on the CIFAR100 dataset indicate that layer nonlinearity remains a hyperparameter, with approximately piecewise-linear functions performing better than more complex functions when the problem space is not densely sampled.