Note. For the biofeedback, the current respiratory sinus arrhythmia, measured through photoplethysmography using the device's integrated camera, was represented as a blue line. A dynamic expanding and contracting circle visually represented the paced breathing rhythm at a frequency of 0.1 Hz. This frequency was further depicted by gray sinusoidal waves in the background, behind the measured heart rate oscillations.
 
The experimenter provided approximately 15 minutes of instructions to participants on how to use the app and how the biofeedback worked. Participants were coached on how to engage in relaxed slow-paced breathing. They were instructed to practice at least 5 minutes daily for the next 4 weeks, with the option to practice for longer periods of time if desired. Participants were also informed that more practice would likely lead to greater benefits.
Three participants encountered technical difficulties while running the application on their devices. To ensure their participation in the intervention, they were provided with an alternative mobile HRVB system ('Qiu' by Biosign®, D-85570, Ottenhofen, Germany).

2.4         Outcome measures

2.4.1   Premenstrual Assessment Form (short form)

The short form of the Premenstrual Assessment Form (PAF20) is a retrospective instrument that assesses PMS symptoms during the last premenstrual phase (Allen et al., 1991). It was derived from the 20 most endorsed items of the long form PAF, which includes almost 100 items (Halbreich et al., 1982). Each item represents one premenstrual symptom, for which the participant must indicate how strongly they experienced it during the last cycle on a 6-point Likert scale from 1 (not at all/no change) to 6 (extreme change). The German translation of the PAF-20 shows good internal consistency and reliability and loads on two factors, indicating a psychological and physiological subscale (Blaser et al., 2023b).
The 10-item version (PAF-10) was constructed using the items with the highest factor loadings and shows a very high correlation with the PAF-20 (Blaser et al., 2023b). To assess the fluctuations of symptoms throughout the cycle and approximate a prospective assessment, the participants filled out the PAF-10 once a week with altered instructions, asking for a report of the 10 symptoms during the last week.

2.4.2   Becks Depression Inventory

The Beck Depression Inventory II (BDI-II) is a widely used questionnaire that assesses the severity of depressive symptoms. It consists of 21 items, each containing four statements about depressive symptoms ranging from 0 (normal) to 3 (most severe). The total maximum score is 63. The BDI-II has good psychometric properties, including high internal consistency, test-retest reliability, and concurrent and discriminant validity. Additionally, the questionnaire has been translated into multiple languages and is widely used in clinical and research settings to assess depression severity, monitor treatment progress, and evaluate outcomes. Previous studies have also shown that the BDI-II has good discrimination between patients with varying degrees of depression and accurately reflects changes in depression intensity over time (Beck et al., 1988; Richter et al., 1998).
The Fast Screen Version of the Becks Depression Inventory (BDI-FS) was developed as a short form to allow for parsimonious screenings, e.g., in research settings. It includes seven items and is based on the DSM-5 criteria for depression, clinical importance, and factor loadings (Beck et al., 2000).

2.4.3   Depression Anxiety Stress Scale

The German version of the Depression Anxiety and Stress Scale (DASS), developed by Henry and Crawford (2005) and based on the original version by Lovibond and Lovibond (1995), was employed for data collection. The DASS-21, a shortened version of the scale, consists of 21 statements that assess three distinct subscales: depression, anxiety, and stress.
Participants were asked to rate the extent to which each statement applied to them during the designated period using Likert scales ranging from 0 to 3. Higher scores on the DASS-21 indicate elevated levels of depressive symptoms, anxiety, and stress.
The internal consistency of the DASS-21 was found to be satisfactory, with a Cronbach's α coefficient of 0.89 (Bibi et al., 2020). The DASS-21 was selected as an outcome measure in this study based on its consistent effects in biofeedback interventions, as demonstrated in prior research (Goessl et al., 2017).

2.4.4   Vagally mediated heart rate variability

Resting vmHRV was determined using the BioSign software and hardware ("HRV-Scanner"; Biosign®, D-85570, Ottenhofen, Germany). Participants had been sitting down for at least 15 minutes before the measurement. The measurement was taken in a sitting position. Participants were instructed to sit comfortably, place their feet side by side on the floor, close their eyes, and were told that they didn't have to pay attention to anything in particular. Following the recommendations by Laborde et al. (2017), the measurement had a duration of 5 minutes.
HRV was measured by a one-lead electrocardiogram (ECG) through two surface sensors attached to the right and left wrists of the participant. The device worked with a sampling rate of 500 Hz and a 16-bit resolution. Artifacts and abnormal beats were filtered in a two-step process following the software documentation (BioSign GmbH, 2023). First, the HRV Scanner software automatically marked areas of the heart rate curve that included implausible changes in heart rate (through the division of the heart rate curve into small segments and a subsequent scan of each segment). This process was based on an algorithm patented by the BioSign company that identifies outliers in a Poincaré plot, where each RR interval is plotted against the previous RR interval.
Working with these recognized areas of possible disturbances, in the second step, the R-spike recognition was manually assessed and corrected, and artifacts (e.g., due to movement) were removed. After the two-step process, the data quality was excellent, with less than 0.1% artifacts per measurement on average.
Participants additionally conducted short resting vmHRV measurements through the app using PPG, as described above, once a week. These measurements lasted one minute, and participants were instructed to take these measurements each week on the same day, at the same time, and in the same place, ensuring they would not be disturbed. They were also instructed to sit comfortably and close their eyes during the measurement, similar to the way they were during the ECG measurement in the laboratory.
We used the root mean square of successive differences (RMSSD) as a measure for vagally mediated heart rate variability. This choice was due to its indication of parasympathetic output and robustness to influences of breathing rate (Chapleau & Sabharwal, 2011).

2.4.5   Attentional network test revised

We employed the ANT-R, developed by Fan et al. (2009), as a measure of attentional control. This task is reaction time task and was designed as a combination of the Eriksen flanker task (Eriksen & Eriksen, 1974) and the Posner cueing task (Posner, 1980).
During the ANT-R, participants were presented with a grey background and a black horizontal arrow. Their task was to indicate the direction of the arrow by pressing the corresponding button with their left or right index finger.
The ANT-R task consists of a total of 288 trials, divided into two identical runs of 144 trials each. The duration of the entire test is approximately 30 minutes. Previous studies have demonstrated good split-half reliability in the Executive (r = .74) and Orienting network scores (r = .70) (Greene et al., 2008). To reduce participant burden, only one run was completed per session.
The task was administered using the Presentation® software (Neurobehavioral Systems, Inc.) on a 24-inch screen positioned 80 cm away from the participants. Before the main task, participants completed 6 practice trials with feedback and 32 practice trials without feedback. Written and visual instructions were provided prior to the practice trials.
During the main task, participants were required to achieve a minimum accuracy of 80%. On average, participants reached an accuracy rate of 95% in the main task block.

2.5         Statistical analysis

All analyses were conducted using R (version 4.2.2). A linear mixed model was calculated for each target variable, with data points clustered per participant by introducing participant intercepts as random effects. When applicable, items were also included as random effects. The fixed effects included in the model were TIME point, TREATMENT, the TREATMENT * TIME interaction, and control variables (AGE, GENDER, BMI, RMSSD), along with exploratory three-way interactions involving potential mediators of the main TREATMENT * TIME effect.
A model selection process was applied to each analysis, with predictors being consecutively added to the model. Likelihood Ratio Testing compared the goodness of fit of each model to the next simpler one, and predictors were retained if they improved the model's fit.
The main hypotheses were tested with TREATMENT * TIME interactions in the respective model. The hypothesis was considered accepted if the interaction term was included in the final model, a significant predictor, and the effect aligned with the expected direction. Post-hoc comparisons and plots of the interaction effects were used to verify the expected direction of the effects.
As this study was a randomized controlled trial (RCT) with a waiting-list control group, the analyses included post-treatment data from the control group. Therefore, the post-treatment data points of the control group were classified into the post-treatment intervention group.
The same procedure was applied for reaction time data, involving three-way interactions instead of two-way interactions. These three-way interactions included TREATMENT, TIME, and FLANKER or CUE condition for the Executive and Orienting Network performance, respectively.
 

3               Results

3.1         Descriptive Statistics

Out of the 29 participants initially included in the study, 27 attended at least the first session and were included in the analyses. However, an additional 3 participants dropped out after the first and before the last session, resulting in 24 participants who completed the study entirely. Table 1 provides a description of the groups that underwent only the intervention and those who completed both the waitlist and intervention protocols.

3.2         Premenstrual Symptoms

Out of 21 PAF20 post values recorded for the T5 measurement, 9 were replaced with the follow-up measurements. This occurred because there was either no premenstrual phase during the intervention period or the premenstrual phase occurred during the first two weeks of the intervention phase.
The final model for predicting premenstrual symptoms incorporated the TREATMENT and TIME variables along with their interaction. Furthermore, it included the SCALE of the PAF20 questionnaire to which each symptom belonged (psychological vs. physiological symptoms) and its interaction with the TREATMENT * TIME interaction (see Table 2). The final model was:
Value ~ treatment*time + scale + treatment:time:scale + (1|vpn) + (1|item).
Post-hoc Tukey testing of the two-way interaction TREATMENT * TIME revealed a significant improvement in the intervention group, d = -0.30, tratio(1258) = -5.89, p < .001, whereas there was no significant pre-post difference in the waitlist group, d = 0.10, tratio(1252) = 1.35, p = .18 (see Figure 3). When SCALE was included in the interaction, it showed that the improvement in the intervention group was larger for psychological scale items (dpsych = -0.42) than for physiological scale items (dphysio = -0.19), with both improvements being significant. Detailed post-hoc testing results for the TIME * TREATMENT * SCALE interaction can be found in Table 3.
 
 
Figure 3. Course of premenstrual symptoms