where W: width. H: height. T: number of sampled frames in video
During learning, we want to use the posterior probabilities from teacher vision network