3.1.1 Dataset preparation
The first step in any machine learning process is to define the desired
dataset. These datasets are sometimes called big data [77]. Volume,
velocity, variety, veracity, variability, value, and visualization are
seven common traits of big data. Therefore, contrary to popular belief,
big data is any dataset with at least one of these seven traits, and
high-volume datasets are not always required for ML models [78].
However, the greater the amount of appropriate data provided, the more
accurate the model built [79]. Each dataset consists of rows and
columns, which rows are called samples, and each column can represent a
feature or a target value. Features are also called dimensions.Figure 3 shows an overview of a dataset. Each dataset may
contain unknown or outlier values for a variety of reasons, such as
high-volume data. Therefore, to prevent modeling errors, it is necessary
to perform a primary dataset inspection and identify unusual or missing
values. Various machine learning algorithms for missing data imputation
and outlier detection have been proposed so far [80]. Input data can
be divided into labeled or unlabeled data. While there are one or more
target values in labeled data, unlabeled data have no target values. For
example, in a labeled dataset, fermentation parameters are features, and
productivity is the target value. In these datasets, usually, the aim is
to find the relationship between the data and make predictions on new
data [77]. In unlabeled datasets, the goal is finding hidden
relationships, clustering, or detecting outliers.