Note. Maybe the rising gradients is an artifact of the single sequence input, as the normal distribution tries to exactly match the target with 0 variance, which is not possible under the current parameterization. I should add a few more sequences and monitor the norms again.

Training on a single batch

It seems that training goes ok and magnitude of gradient norms is contained.