Depression is a severe psychosocial pathology that causes mood changes, characterized by a strong feeling of hopelessness and deep sadness. In advanced stages, it can predispose patients to suicidal thoughts, highlighting the importance of finding methods that provide more accurate diagnoses. Traditional diagnosis relies on semi-structured interviews and complementary questionnaires. Combining these methods with careful data analysis that incorporates audiovisual and textual characteristics can obtain valuable clues about the presence of depression in individuals. Therefore, this study proposes a multimodal Ensemble Stacking Deep Neural Network model based on the analysis of facial expression characteristics, audio signals, and textual transcriptions to automatically detect depression. A comprehensive model was evaluated on the multimodal Distress Analysis Interview Corpus-Wizard of Oz dataset. We incorporated substantial volumes of data into the analysis and achieved a degree of separability greater than 0.9. Our results demonstrate both the effectiveness of the method and its superiority to other reference approaches.