A Hyper-Heuristic Approach for Unsupervised Land-Cover Classification

In the last decades, advances in remote sensing image acquisition systems have moved in lockstep with the need for applications that make use of such sort of data. Land-use/cover recognition (Pisani 2014, Bagan 2010, Capucim 2015, Banerjee 2015), target recognition (Dong 2015, Martorella 2011, Du 2011), image classification (Li 2015, Voisin 2014) and band selection in hyper-spectral images (Yang 2011, Yuan 2015) are among the most pursued applications, just to name a few. The large amount of high-resolution content available by satellites also highlights the bottleneck that takes place when labeling data. Such process is skilled-dependent, and it might be very prone to errors when dealing with manual annotation. Such shortcomings have fostered even more the research on semi-supervised and unsupervised techniques, which may work well in some remote sensing-oriented applications.

Considered a hallmark in the pattern recognition research field, the so-called \(k\)-means algorithm (MacQueen 1967) has been consistently enhanced in the last decades. Given it does not make use of labeled data and it has a simple formulation, \(k\)-means is still one of the most used classification techniques up to date. Roughly speaking, given a set of feature vectors (samples) extracted from a dataset, \(k\)-means tries to minimize the distance from each sample to its closest center (mean). Such process ends up clustering the data after some steps, being two samples from the same cluster more “

The aforementioned scenario turns \(k\)-means algorithm more prone to be addressed by means of optimization techniques, mainly those based on nature- and evolutionary-oriented mechanisms. Actually, not only \(k\)-means but a number of other techniques have used the framework of meta-heuristic-based optimization to cope with problems that somehow can be modeled as a task of finding decision variables that maximize/minimize some certain fitness function. Chen et al. (Chen 2009), for instance, employed Genetic Algorithms (GAs) and neural networks to classify both land-use and landslide zones in eastern Taiwan, being the former used to compute the set of weights that combine some landslide incidence factors. Nakamura et al. (Nakamura 2014) dealt with the task of band selection in hyper-spectral imagery through nature-inspired techniques. Truly speaking, the idea is to model the problem of finding the most important bands as a feature selection task. Without loss of generality, both problems are the very same one when the brightness of each pixel is used to represent it.

Very recently, Goel et al. (Goel 2015) tackled the problem of remote sensing image classification using some nature-inspired techniques, say that Cuckoo Search and Artificial Bee Colony. Senthilnatha et al. (Senthilnatha 2014) used GAs, Particle Swarm Optimization and Firefly Algorithm for the automatic image registering of multi-temporal remote sensing data. In short, the idea is to perform image registration while minimizing some criterion function (Mutual Information in that case). The theory about Artificial Immune Systems has been used to classify remote sensing data as well (Kheddam 2014), in which a multi-band image covering the area of northeastern part of Algiers was used for validation purposes.

Coming back to the \(k\)-means technique, Chandran and Nazeer (Chandran 2011) proposed to solve the problem of minimizing the distance from each dataset sample to its nearest centroid using the Harmony Search, which is a meta-heuristic optimization technique based on the way musicians create songs in order to obtain the best harmony. Forsati et al. (Forsati 2008) employed a similar approach, but in the context of web page clustering, while Lin et al. (Lin 2012) proposed a hybrid approach concerning the task of \(k\)-means clustering and Particle Swarm Optimization. Later on, Kuo et al. (Kuo 2013) integrated \(k\)-means and Artificial Immune Systems for dataset clustering, and Saida et al. (Saida 2014) employed the Cuckoo Search to optimize \(k\)-means aiming at classifying documents. Finally, a comprehensive study about the application of nature-inspired techniques to boost \(k\)-means was presented by Fong et al. (Fong 2014).

Despite all aforementioned works aimed at enhancing \(k\)-means using meta-heuristic techniques, there is a little concern about the application of hyper-heuristic techniques for that purpose, as well as only a very few works attempted at dealing with \(k\)-means optimization in the context of land-use/cover classification. The term “

The remainder of this paper is organized as follows. Section \ref{s.theoretical} presents the theoretical background regarding the meta-heuristic optimization techniques addressed in this work. Sections \ref{s.proposed} and \ref{s.material} present the proposed approach and the experimental setup, respectively. Section \ref{s.experiments} discusses the experiments, and Section \ref{s.conclusions} states conclusions and future works.

\label{s.theoretical}

In this section, we briefly present the theoretical background regarding the meta-heuristic techniques employed in this paper, as well as some basis related to optimization-based problems.

Let \({\cal S}=\{\textbf{x}_1,\textbf{x}_2,\ldots,\textbf{x}_m\}\) be a search space, where each possible solution \(\textbf{x}_i\in\Re^n\) is composed of \(n\) decision variables, and \(x_{i,j}\) stands for the \(j^{th}\) decision variable of agent \(i\). Additionally, let \(f:{\cal S}\rightarrow\Re\) be a function to be minimized/maximized^{1}. Roughly speaking, the main idea of any optimization problem is to solve the following equation:

\[\label{e.minimization} \textbf{x}^\ast = \displaystyle \min_{\textbf{x}\in{\cal S}}f(\textbf{x}),\]

where \(\textbf{x}^\ast\) stands for the best solution so far. Without loss of generality, the optimization techniques differ on the way they attempt at solving the above equation. The same occurs when working with meta-heuristics, since each one is based on a different social mechanism or living being. Since the terminology among techniques might be quite different but with similar purposes, we generalize each possible solution to the name “

\label{ss.hs}

Harmony Search (HS) is a meta-heuristic algorithm inspired in the improvisation process of music players, since they often improvise the pitches of their instruments searching for a perfect state of harmony (Music-Inspired Harmon...). The main idea is to use a similar process to the one adopted by musicians when creating new songs, where each possible solution is modeled as a harmony (agent), and each musician corresponds to one decision variable.

In the context of HS, our search space \({\cal S}\) is called “

The memory consideration step concerns with modeling the process of creating songs, in which the musician can use his/her memories of good musical notes to create a new song. This process is modeled by the Harmony Memory Considering Rate (\(HMCR\)) parameter, which is the probability of choosing one value from the historic values stored in the harmony memory, being \((1-HMCR)\) the probability of randomly choosing one feasible value, as follows:

\[\begin{aligned} \label{e.hmcr} \hat{x}_{m+1,j} & = & \left\{ \begin{array}{ll} x_{A,j} & \mbox{ with probability $HMCR$} \\ \theta \in \bm{\Phi}_j & \mbox{ with probability (1-$HMCR$),} \end{array}\right.\end{aligned}\]

where \(A\sim {\cal U}(1,2,\ldots,m)\), and \(\bm{\Phi}=\{\bm{\Phi}_1,\bm{\Phi}_2,\ldots,\bm{\Phi}_m\}\) stands for the set of feasible values for each decision variable.

Further, every component \(j\) of the new harmony vector \(\textbf{x}_{m+1}\) is examined to determine whether it should be pitch-adjusted or not, being such step controlled by the Pitch Adjusting Rate (PAR) variable, as follows:

\[\begin{aligned} \label{e.par} x_{m+1,j} & = & \left\{ \begin{array}{ll} x_{m+1,j}\pm \varphi_j \varrho & \mbox{ with probability $PAR$} \\ x_{m+1,j} & \mbox{ with probability (1-$PAR$).} \end{array}\right.\end{aligned}\]

The pitch adjustment is often used to improve solutions and to escape from local optima. This mechanism concerns shifting the neighboring values of some decision variable in the harmony, where \(\varrho\) is an arbitrary distance bandwidth, and \(\varphi_j\sim {\cal U}(0,1)\).

\label{ss.ihs}

The Improved Harmony Search (IHS) (Mahdavi 2007) differs from traditional HS by updating the \(PAR\) and \(\varrho\) values dynamically. The PAR updating formulation at time step \(t\) is given by:

\[\label{e.par_ihs} PAR^t = PAR_{min}+\frac{PAR_{max}-PAR_{min}}{T}t,\]

where \(T\) stands for the number of iterations, and \(PAR_{min}\) and \(PAR_{max}\) denote the minimum and maximum \(PAR\) values, respectively. In regard to the bandwidth value at time step \(t\), it is computed as follows:

\[\label{e.bandwidth_ihs} \varrho^t=\varrho_{max}\exp{\frac{\ln(\varrho_{min}/\varrho_{max})}{T}t},\]

where \(\varrho_{min}\) and \(\varrho_{max}\) stand for the minimum and maximum values of \(\varrho\), respectively.

\label{ss.ghs}

The Global-best Harmony Search (GHS) (Omran 2008) employs the same modification proposed by IHS with respect to dynamic \(PAR\) values. However, it does not employ the concept of bandwidth, being Equation \ref{e.par} replaced by: \[\label{e.par_ghs} x_{m+1,j} = x_{best,z},\] where \(z\sim U(1,2,\ldots,n)\), and \(best\) stands for the index of the best harmony.

\label{ss.nghs}

The Novel Global Harmony Search (NGHS) (Zou 2010) differs from traditional HS in three aspects: (i) the \(HMCR\) and \(PAR\) parameters are excluded, and a mutation probability \(p_m\) is then used; (ii) the NGHS always replaces the worst harmony with the new one, and (iii) the improvisation footsteps are also modified, as follows:

\[R=2x_{best,j}-x_{worst,j},\]

\[x_{m+1,j}=x_{worst,j}+\mu_j(R-x_{worst,j}),\]

where \(worst\) stands for the index of the worst harmony, and \(\mu_j\sim U(0,1)\). Further, another modification with respect to the mutation probability is performed in the new harmony:

\[\begin{aligned} x_{m+1,j} & = & \left\{ \begin{array}{ll} L_j+ \varpi_j(U_j-L_j) & \mbox{ if $\kappa_j\leq p_m$} \\ x_{m+1,j} & \mbox{ otherwise,} \end{array}\right.\end{aligned}\]

where \(\kappa_j,\varpi_j\sim U(0,1)\), and \(U_j\) and \(L_j\) stand for the upper and lower bounds of decision variable \(j\), respectively.

\label{ss.sghs}

The SGHS algorithm (Pan 2010) is a modification of GHS that employs a new improvisation scheme and self-adaptive parameters. First of all, Equation \ref{e.par_ghs} is rewritten as follows:

\[\label{e.par_sghs} x_{m+1,j} = x_{best,j},\]

and Equation \ref{e.hmcr} can be replaced by:

\[\begin{aligned} \label{e.hmcr_sghs} x_{m+1,j} & = & \left\{ \begin{array}{ll} x_{A,j}\pm \varphi_j \varrho & \mbox{ with probability $HMCR$} \\ \theta \in \bm{\Phi}_j & \mbox{ with probability (1-$HMCR$).} \end{array}\right.\end{aligned}\]

The main difference among SGHS and the other variants concerns with the computation of \(HMCR\) and \(PAR\) values, which are estimated based on the average of their recorded values after each \(LP\) (learning period) iterations. Every time a new agent is better than the worst one, the \(HMCR\) and \(PAR\) values are then recorded to be used in the estimation of their new values, which follow a Gaussian distribution, i.e., \(HMCR\sim{\cal N}(HMCR_m,\sigma_{HMCR})\) and \(PAR\sim{\cal N}(PAR_m,\sigma_{PAR})\), where \(HMCR_m\) and \(PAR_m\) stand for the mean values of \(HMCR\) and \(PAR\) parameters, respectively.

\label{s.pso}

Particle Swarm Optimization (PSO) is an algorithm modeled on swarm intelligence dynamics that finds a solution in a search space based on social behavior (Kennedy 2001). Each possible solution (agent) is modeled as a particle in the swarm that imitates its neighborhood based on the values of the fitness function found so far.

Each particle has a memory that stores its best solution, as well as the best solution of the entire swarm. Thus, taking this information into account, each particle has the ability to imitate others that obtain the best local and global maxima. This process simulates the social interaction between humans looking for the same objective, or bird flocks looking for food, for instance. This socio-cognitive mechanism can be summarized into three main principles: (i) evaluation, (ii) comparison, and (iii) imitation. Each particle can evaluate others in its neighborhood through some fitness function, can compare it with its own value and, finally, can decide whether it is a good choice to imitate them. PSO makes use of both velocity and position terms to perform optimization at time step \(t\), as follows:

\[\label{e.velocity_pso} \textbf{v}^{t+1}_i=w\textbf{v}^t_i+c_1\delta_1(\hat{\textbf{x}}_i-\textbf{x}_i)+c_2\delta_2(\hat{\textbf{g}}-\textbf{x}_i),\]

where \(\textbf{v}^t_i\) stands for the velocity of agent (particle) \(i\) at iteration \(t\), \(w\) is the inertia weight, \(\delta_1,\delta_2\sim{\cal U}(0,1)\), and \(c_1\) and \(c_2\) are ad-hoc variables. Next, the new position of each agent \(i\) is updated as follows:

\[\label{e.position_pso} \textbf{x}^{t+1}_i=\textbf{x}^t_i+\textbf{v}^{t+1}_i.\]

\label{s.ba}

Based on the behavior of bats, Yang e Gandomi (Yang 2012) proposed a new meta-heuristic optimization technique called Bat Algorithm (BA), which has been designed to behave as a band of bats tracking prey/foods using their capability of echolocation. BA works under certain assumptions: (i) all bats use echolocation to sense distance, and they also “

At each time step \(t\), the frequency and velocity of each agent \(i\) are computed using Equations \ref{frequency_ba} and \ref{velocity_ba}, respectively:

\[\label{e.frequency_ba} q^t_i=q_{min}+(q_{min}-q_{max})\beta,\]

\[\label{e.velocity_ba} \textbf{v}^{t+1}_i=\textbf{v}^t_i+(\textbf{x}^t_i-\textbf{g})q_i,\]

where \(\beta\sim{\cal U}(0,1)\), and \(\textbf{g}\) stands for the best solution (bat) found so far (similar rationale is also employed by Equation \ref{velocity_pso}).

The Bat Algorithm works with the definition of “

After computing the frequency and velocity using Equations \ref{e.frequency_ba} and \ref{e.velocity_ba}, we can start working with the movement of each agent, as follows:

\[\label{e.position_tmp_bat} \tilde{\textbf{x}}_i^{t+1}=\textbf{x}^t_i+\textbf{v}^{t+1}_i.\]

Further, we apply a random walk with probability \(r_i\) (also known as “

\[\begin{aligned} \label{e.random_walk_bat} \tilde{\textbf{x}}_i^{t+1} & = & \left\{ \begin{array}{ll} \tilde{\textbf{x}}_i^{t+1}+\epsilon\bar{A}^t & \mbox{ with probability $r_i^t$} \\ \tilde{\textbf{x}}_i^{t+1} & \mbox{ with probability (1-$r_i^t$),} \end{array}\right.\end{aligned}\]

where \(\bar{A}^t\) stands for the average of the loudness considering all agents at iteration \(t\), and \(\epsilon\in[-1,1]\). Finally, the new position of each agent is then computed as follows:

\[\begin{aligned} \label{e.position_bat} \textbf{x}_i^{t+1} & = & \left\{ \begin{array}{ll} \tilde{\textbf{x}}_i^{t+1} & \mbox{ if $f(\tilde{\textbf{x}}_i^{t+1})<f(\textbf{x}_i^{t+1})$ and $x^i<A_i^t$} \\ \textbf{x}^{t+1} & \mbox{ otherwise} \end{array}\right.\end{aligned}\]

where \(\xi\sim{\cal U}(0,1)\).

Finally, we can also update the loudness \(A_i\) and pulse rate \(r_i\) at each iteration as follows:

\[\label{e.update_loudness} A_i^{t+1}=\alpha A_i^t,\]

\[\label{e.pulse_rate} r^{t+1}_i=r^0_i(1-e^{-\gamma t}),\]

where \(\alpha,\gamma\sim{\cal U}(0,1)\).

\label{s.ffa}

The Firefly Algorithm (FFA) (Yang 2010) is a meta-heuristic optimization technique inspired by the flashing behavior of fireflies, which attempt at flashing in order to attract possible mates. Let \(\textbf{x}_i^t\) be the position of firefly (agent) \(i\) at iteration \(t\). Roughly speaking, the idea of FFA is to move agents towards brighter ones, i.e., agents with lower fitness functions, as follows: \[\begin{aligned} \label{e.position_ffa} \textbf{x}^{t+1}_i & = & \left\{ \begin{array}{ll} \textbf{x}^t_i +\psi e^{-\tau d_{i,j}^2}(\textbf{x}^t_j-\textbf{x}^t_i)+\eta^t\bm{\lambda}^t_i & \mbox{ if $f(\textbf{x}_i^t)<f(\textbf{x}_j^t)$} \\ \textbf{x}^t_i & \mbox{ otherwise,} \end{array}\right.\end{aligned}\] where \(\bm{\lambda}_i\) is an array usually sampled from a Gaussian distribution, \(\eta^t\) stands for the step size, which can be linearly decreased, \(d_{i,j}\) stands for the distance between fireflies \(i\) and \(j\), and \(\psi\) is the so-called “

\label{s.proposed}

Essentially, the problem of estimating the centroids of \(k\) clusters can be modeled as an optimization task, in which we aim at minimizing the distance from each dataset sample to its nearest centroid. Therefore, any fitness function that somehow encodes such behavior can be employed. In this work, we used the “

\[\label{e.addc} ADDC = \frac{1}{N}\sum_{i=1}^k\sum_{\forall x_i\in c_j}D(c_i,x_j),\]

where \(D(\textbf{c}_j,\textbf{x}_j)\) stands for the distance between centroid \(\textbf{c}_j\in\Re^n\) and sample \(\textbf{x}_i\), and \(N\) denotes the number of dataset samples.

Roughly speaking, given a problem with \(k\) clusters, each agent (e.g., harmonies, particles, bats or fireflies) encodes a possible solution in \(\Re^{k*n}\), as depicted in Figure \ref{f.problem_representation}. Therefore, after placing all agents with random positions, the \(k\)-means algorithm is executed once for each agent using that positions as the starting point. Soon after, the ADDC is computed over the final clustered dataset to be used as the fitness function for each meta-heuristic technique, which outputs the optimum/near-optimum possible solution with the best starting points for \(k\)-means.

[FIGURE 1]

The work proposed by Papa et al. (Papa 2015) goes beyond that point by combining different solutions obtained through distinct meta-heuristic techniques. Since each technique has its own weaknesses, the idea is to explore a higher level of optimization in order to improve each individual solution by means of the combination of all obtained solutions so far. Although such step can be performed by any optimization technique, we opted to employ Genetic Programming (GP) for two main reasons: (i) we did not use any meta-heuristic technique that has been employed during the first step of optimization in order to avoid biases, and (ii) GP provides a more powerful combination process as a hyper-heuristic technique, since it can apply a number of arithmetic operations for that purpose, instead of using movement-based equations to place agents from one position to another.

Genetic Programming (Koza 1992) is an evolutionary-based optimization algorithm that models each solution as an individual, which is usually represented as a tree composed of “

[FIGURE 2 a,b]

The work by Papa et al. (Papa 2015) employs the best result of each meta-heuristic technique (i.e., HS, IHS, GHS, NGHS and SGHS) to compose the set of terminal nodes. This means GP can use any technique from that set for combination purposes, and therefore can decide which one will compose the best individual right after the evolutionary-oriented optimization process. In the present work, we propose to employ not only HS-based meta-heuristic techniques to compose the set of terminal nodes, but also other techniques, such as Particle Swarm Optimization, Bat Algorithm and Firefly Algorithm. The main assumption here concerns that exploring different mechanisms may allow us to obtain more accurate results, since we can count with more variable results. Such assumption has proven to be correct, since the results presented in this paper outperformed the ones discussed in the work by Papa et al. (Papa 2015).

\label{s.material}

\label{s.experiments}

\label{s.conclusions}

The authors are grateful to FAPESP grants #2013/20387-7 and #2014/16250-9, as well as CNPq grants #470571/2013-6 and #306166/2014-3.

For the sake of simplicity, we adopted a minimization problem.↩

## Share on Social Media