Note. Maybe the rising gradients is an artifact of the single sequence input, as the normal distribution tries to exactly match the target with 0 variance, which is not possible under the current parameterization. I should add a few more sequences and monitor the norms again.
Training on a single batch
It seems that training goes ok and magnitude of gradient norms is contained.