\(D(S,L) = -\sum_j L_{j}\log (\sigma (z_j))\)
Then the softmax cross-entropy (Equation 3.0 above) is defined and gradient descent optimizer with a learning rate of 0.5 is used to perform training. 1000 epochs were used with a batch size of 100. The weights of the network were initialized as zeros.