3.1.1 Dataset preparation
The first step in any machine learning process is to define the desired dataset. These datasets are sometimes called big data [77]. Volume, velocity, variety, veracity, variability, value, and visualization are seven common traits of big data. Therefore, contrary to popular belief, big data is any dataset with at least one of these seven traits, and high-volume datasets are not always required for ML models [78]. However, the greater the amount of appropriate data provided, the more accurate the model built [79]. Each dataset consists of rows and columns, which rows are called samples, and each column can represent a feature or a target value. Features are also called dimensions.Figure 3 shows an overview of a dataset. Each dataset may contain unknown or outlier values for a variety of reasons, such as high-volume data. Therefore, to prevent modeling errors, it is necessary to perform a primary dataset inspection and identify unusual or missing values. Various machine learning algorithms for missing data imputation and outlier detection have been proposed so far [80]. Input data can be divided into labeled or unlabeled data. While there are one or more target values in labeled data, unlabeled data have no target values. For example, in a labeled dataset, fermentation parameters are features, and productivity is the target value. In these datasets, usually, the aim is to find the relationship between the data and make predictions on new data [77]. In unlabeled datasets, the goal is finding hidden relationships, clustering, or detecting outliers.