In the last decades, advances in remote sensing image acquisition systems have moved in lockstep with the need for applications that make use of such sort of data. Land-use/cover recognition (Pisani 2014, Bagan 2010, Capucim 2015, Banerjee 2015), target recognition (Dong 2015, Martorella 2011, Du 2011), image classification (Li 2015, Voisin 2014) and band selection in hyper-spectral images (Yang 2011, Yuan 2015) are among the most pursued applications, just to name a few. The large amount of high-resolution content available by satellites also highlights the bottleneck that takes place when labeling data. Such process is skilled-dependent, and it might be very prone to errors when dealing with manual annotation. Such shortcomings have fostered even more the research on semi-supervised and unsupervised techniques, which may work well in some remote sensing-oriented applications.

Considered a hallmark in the pattern recognition research field, the so-called \(k\)-means algorithm (MacQueen 1967) has been consistently enhanced in the last decades. Given it does not make use of labeled data and it has a simple formulation, \(k\)-means is still one of the most used classification techniques up to date. Roughly speaking, given a set of feature vectors (samples) extracted from a dataset, \(k\)-means tries to minimize the distance from each sample to its closest center (mean). Such process ends up clustering the data after some steps, being two samples from the same cluster more “connected" to its centroid than to any other in the dataset. As its main drawbacks, we can shed light the number of clusters required as an input, and the leaning of the naïve algorithm to get trapped from local optima, i.e., centroids that do not represent well the clusters.

The aforementioned scenario turns \(k\)-means algorithm more prone to be addressed by means of optimization techniques, mainly those based on nature- and evolutionary-oriented mechanisms. Actually, not only \(k\)-means but a number of other techniques have used the framework of meta-heuristic-based optimization to cope with problems that somehow can be modeled as a task of finding decision variables that maximize/minimize some certain fitness function. Chen et al. (Chen 2009), for instance, employed Genetic Algorithms (GAs) and neural networks to classify both land-use and landslide zones in eastern Taiwan, being the former used to compute the set of weights that combine some landslide incidence factors. Nakamura et al. (Nakamura 2014) dealt with the task of band selection in hyper-spectral imagery through nature-inspired techniques. Truly speaking, the idea is to model the problem of finding the most important bands as a feature selection task. Without loss of generality, both problems are the very same one when the brightness of each pixel is used to represent it.

Very recently, Goel et al. (Goel 2015) tackled the problem of remote sensing image classification using some nature-inspired techniques, say that Cuckoo Search and Artificial Bee Colony. Senthilnatha et al. (Senthilnatha 2014) used GAs, Particle Swarm Optimization and Firefly Algorithm for the automatic image registering of multi-temporal remote sensing data. In short, the idea is to perform image registration while minimizing some criterion function (Mutual Information in that case). The theory about Artificial Immune Systems has been used to classify remote sensing data as well (Kheddam 2014), in which a multi-band image covering the area of northeastern part of Algiers was used for validation purposes.

Coming back to the \(k\)-means technique, Chandran and Nazeer (Chandran 2011) proposed to solve the problem of minimizing the distance from each dataset sample to its nearest centroid using the Harmony Search, which is a meta-heuristic optimization technique based on the way musicians create songs in order to obtain the best harmony. Forsati et al. (Forsati 2008) employed a similar approach, but in the context of web page clustering, while Lin et al. (Lin 2012) proposed a hybrid approach concerning the task of \(k\)-means clustering and Particle Swarm Optimization. Later on, Kuo et al. (Kuo 2013) integrated \(k\)-means and Artificial Immune Systems for dataset clustering, and Saida et al. (Saida 2014) employed the Cuckoo Search to optimize \(k\)-means aiming at classifying documents. Finally, a comprehensive study about the application of nature-inspired techniques to boost \(k\)-means was presented by Fong et al. (Fong 2014).

Despite all aforementioned works aimed at enhancing \(k\)-means using meta-heuristic techniques, there is a little concern about the application of hyper-heuristic techniques for that purpose, as well as only a very few works attempted at dealing with \(k\)-means optimization in the context of land-use/cover classification. The term “hyper-heuristics" was coined to address new algorithms designed to solve general problems by combining known meta-heuristics, in such a way each technique may compensate the weaknesses of others (Ross 2005). In such context, Papa et al. (Papa 2015) were one of the first that focused on the application of hyper-heuristics to optimize \(k\)-means, being the proposed approach validated in the background of both satellite- and radar-based land-cover classification^{1}. That work employed Genetic Programming to combine five variations of the Harmony Search algorithm with promising results. In this paper, we extend the work by Papa et al. (Papa 2015) with a deeper experimental analysis, in which Particle Swarm Optimization, Bat Algorithm and Firefly Algorithm are also considered together with Harmony Search and its variants for combination purposes through Genetic Programming. The results obtained in this paper outperformed the previous work by Papa et al. (Papa 2015), thus emphasizing the benefits of the hyper-heuristic-based framework. To the best of our knowledge, that is the first time such approaches are combined with each other aiming at optimizing \(k\)-means.

The remainder of this paper is organized as follows. Section \ref{s.theoretical} presents the theoretical background regarding the meta-heuristic optimization techniques addressed in this work. Sections \ref{s.proposed} and \ref{s.material} present the proposed approach and the experimental setup, respectively. Section \ref{s.experiments} discusses the experiments, and Section \ref{s.conclusions} states conclusions and future works.

\label{s.theoretical}

In this section, we briefly present the theoretical background regarding the meta-heuristic techniques employed in this paper, as well as some basis related to optimization-based problems.

Let \({\cal S}=\{\textbf{x}_1,\textbf{x}_2,\ldots,\textbf{x}_m\}\) be a search space, where each possible solution \(\textbf{x}_i\in\Re^n\) is composed of \(n\) decision variables, and \(x_{i,j}\) stands for the \(j^{th}\) decision variable of agent \(i\). Additionally, let \(f:{\cal S}\rightarrow\Re\) be a function to be minimized/maximized^{2}. Roughly speaking, the main idea of any optimization problem is to solve the following equation:

\[\label{e.minimization} \textbf{x}^\ast = \displaystyle \min_{\textbf{x}\in{\cal S}}f(\textbf{x}),\]

where \(\textbf{x}^\ast\) stands for the best solution so far. Without loss of generality, the optimization techniques differ on the way they attempt at solving the above equation. The same occurs when working with meta-heuristics, since each one is based on a different social mechanism or living being. Since the terminology among techniques might be quite different but with similar purposes, we generalize each possible solution to the name “agent" (e.g., harmonies considering Harmony Search, particles for Particle Swarm Optimization, bats with respect to Bat Algorithm, and fireflies regarding Firefly Algorithm). In addition, since they usually adopt different variables for the same purpose, we decided to keep the same variables with similar rationale for all techniques^{3}. Next, we present the main basis of the meta-heuristic techniques used in this paper.

\label{ss.hs}

Harmony Search (HS) is a meta-heuristic algorithm inspired in the improvisation process of music players, since they often improvise the pitches of their instruments searching for a perfect state of harmony (Music-Inspired Harmon...). The main idea is to use a similar process to the one adopted by musicians when creating new songs, where each possible solution is modeled as a harmony (agent), and each musician corresponds to one decision variable.

In the context of HS, our search space \({\cal S}\) is called “Harmony Memory", since it comprises all possible solutions encoded on each agent. Harmony Search algorithm generates at each iteration a new agent \(\textbf{x}_{m+1}\) based on memory considerations, pitch adjustments, and randomization (music improvisation). Further, the new agent is evaluated in order to be accepted in the harmony memory: if \(\textbf{x}_{m+1}\) is better than the worst harmony, the latter is then replaced by the new agent. Roughly speaking, HS algorithm basically rules the process of creating and evaluating new harmonies until some convergence criterion is met.

The memory consideration step concerns with modeling the process of creating songs, in which the musician can use his/her memories of good musical notes to create a new song. This process is modeled by the Harmony Memory Considering Rate (\(HMCR\)) parameter, which is the probability of choosing one value from the historic values stored in the harmony memory, being \((1-HMCR)\) the probability of randomly choosing one feasible value, as follows:

\[\begin{aligned} \label{e.hmcr} \hat{x}_{m+1,j} & = & \left\{ \begin{array}{ll} x_{A,j} & \mbox{ with probability $HMCR$} \\ \theta \in \bm{\Phi}_j & \mbox{ with probability (1-$HMCR$),} \end{array}\right.\end{aligned}\]

where \(A\sim {\cal U}(1,2,\ldots,m)\), and \(\bm{\Phi}=\{\bm{\Phi}_1,\bm{\Phi}_2,\ldots,\bm{\Phi}_m\}\) stands for the set of feasible values for each decision variable.

Further, every component \(j\) of the new harmony vector \(\textbf{x}_{m+1}\) is examined to determine whether it should be pitch-adjusted or not, being such step controlled by the Pitch Adjusting Rate (PAR) variable, as follows:

\[\begin{aligned} \label{e.par} x_{m+1,j} & = & \left\{ \begin{array}{ll} x_{m+1,j}\pm \varphi_j \varrho & \mbox{ with probability $PAR$} \\ x_{m+1,j} & \mbox{ with probability (1-$PAR$).} \end{array}\right.\end{aligned}\]

The pitch adjustment is often used to improve solutions and to escape from local optima. This mechanism concerns shifting the neighboring values of some decision variable in the harmony, where \(\varrho\) is an arbitrary distance bandwidth, and \(\varphi_j\sim {\cal U}(0,1)\).

\label{ss.ihs}

The Improved Harmony Search (IHS) (Mahdavi 2007) differs from traditional HS by updating the \(PAR\) and \(\varrho\) values dynamically. The PAR updating formulation at time step \(t\) is given by:

\[\label{e.par_ihs} PAR^t = PAR_{min}+\frac{PAR_{max}-PAR_{min}}{T}t,\]

where \(T\) stands for the number of iterations, and \(PAR_{min}\) and \(PAR_{max}\) denote the minimum and maximum \(PAR\) values, respectively. In regard to the bandwidth value at time step \(t\), it is computed as follows:

\[\label{e.bandwidth_ihs} \varrho^t=\varrho_{max}\exp{\frac{\ln(\varrho_{min}/\varrho_{max})}{T}t},\]

where \(\varrho_{min}\) and \(\varrho_{max}\) stand for the minimum and maximum values of \(\varrho\), respectively.

\label{ss.ghs}

The Global-best Harmony Search (GHS) (Omran 2008) employs the same modification proposed by IHS with respect to dynamic \(PAR\) values. However, it does not employ the concept of bandwidth, being Equation \ref{e.par} replaced by: \[\label{e.par_ghs} x_{m+1,j} = x_{best,z},\] where \(z\sim U(1,2,\ldots,n)\), and \(best\) stands for the index of the best harmony.

\label{ss.nghs}

The Novel Global Harmony Search (NGHS) (Zou 2010) differs from traditional HS in three aspects: (i) the \(HMCR\) and \(PAR\) parameters are excluded, and a mutation probability \(p_m\) is then used; (ii) the NGHS always replaces the worst harmony with the new one, and (iii) the improvisation footsteps are also modified, as follows:

\[R=2x_{best,j}-x_{worst,j},\]

\[x_{m+1,j}=x_{worst,j}+\mu_j(R-x_{worst,j}),\]

where \(worst\) stands for the index of the worst harmony, and \(\mu_j\sim U(0,1)\). Further, another modification with respect to the mutation probability is performed in the new harmony:

\[\begin{aligned} x_{m+1,j} & = & \left\{ \begin{array}{ll} L_j+ \varpi_j(U_j-L_j) & \mbox{ if $\kappa_j\leq p_m$} \\ x_{m+1,j} & \mbox{ otherwise,} \end{array}\right.\end{aligned}\]

where \(\kappa_j,\varpi_j\sim U(0,1)\), and \(U_j\) and \(L_j\) stand for the upper and lower bounds of decision variable \(j\), respectively.

\label{ss.sghs}

The SGHS algorithm (Pan 2010) is a modification of GHS that employs a new improvisation scheme and self-adaptive parameters. First of all, Equation \ref{e.par_ghs} is rewritten as follows:

\[\label{e.par_sghs} x_{m+1,j} = x_{best,j},\]

and Equation \ref{e.hmcr} can be replaced by:

\[\begin{aligned} \label{e.hmcr_sghs} x_{m+1,j} & = & \left\{ \begin{array}{ll} x_{A,j}\pm \varphi_j \varrho & \mbox{ with probability $HMCR$} \\ \theta \in \bm{\Phi}_j & \mbox{ with probability (1-$HMCR$).} \end{array}\right.\end{aligned}\]

The main difference among SGHS and the other variants concerns with the computation of \(HMCR\) and \(PAR\) values, which are estimated based on the average of their recorded values after each \(LP\) (learning period) iterations. Every time a new agent is better than the worst one, the \(HMCR\) and \(PAR\) values are then recorded to be used in the estimation of their new values, which follow a Gaussian distribution, i.e., \(HMCR\sim{\cal N}(HMCR_m,\sigma_{HMCR})\) and \(PAR\sim{\cal N}(PAR_m,\sigma_{PAR})\), where \(HMCR_m\) and \(PAR_m\) stand for the mean values of \(HMCR\) and \(PAR\) parameters, respectively.

\label{s.pso}

Particle Swarm Optimization (PSO) is an algorithm modeled on swarm intelligence dynamics that finds a solution in a search space based on social behavior (Kennedy 2001). Each possible solution (agent) is modeled as a particle in the swarm that imitates its neighborhood based on the values of the fitness function found so far.

Each particle has a memory that stores its best solution, as well as the best solution of the entire swarm. Thus, taking this information into account, each particle has the ability to imitate others that obtain the best local and global maxima. This process simulates the social interaction between humans looking for the same objective, or bird flocks looking for food, for instance. This socio-cognitive mechanism can be summarized into three main principles: (i) evaluation, (ii) comparison, and (iii) imitation. Each particle can evaluate others in its neighborhood through some fitness function, can compare it with its own value and, finally, can decide whether it is a good choice to imitate them. PSO makes use of both velocity and position terms to perform optimization at time step \(t\), as follows:

\[\label{e.velocity_pso} \textbf{v}^{t+1}_i=w\textbf{v}^t_i+c_1\delta_1(\hat{\textbf{x}}_i-\textbf{x}_i)+c_2\delta_2(\hat{\textbf{g}}-\textbf{x}_i),\]

where \(\textbf{v}^t_i\) stands for the velocity of agent (particle) \(i\) at iteration \(t\), \(w\) is the inertia weight, \(\delta_1,\delta_2\sim{\cal U}(0,1)\), and \(c_1\) and \(c_2\) are ad-hoc variables. Next, the new position of each agent \(i\) is updated as follows:

\[\label{e.position_pso} \textbf{x}^{t+1}_i=\textbf{x}^t_i+\textbf{v}^{t+1}_i.\]

\label{s.ba}

Based on the behavior of bats, Yang e Gandomi (Yang 2012) proposed a new meta-heuristic optimization technique called Bat Algorithm (BA), which has been designed to behave as a band of bats tracking prey/foods using their capability of echolocation. BA works under certain assumptions: (i) all bats use echolocation to sense distance, and they also “know" the difference between food/prey and background barriers; and (ii) a bat (agent) \(i\) flies randomly with velocity \(\textbf{v}_i\) at position \(\textbf{x}_i\) and with a fixed frequency \(q_i\). For each agent \(i\), we also associate a loudness value \(A_i\).

At each time step \(t\), the frequency and velocity of each agent \(i\) are computed using Equations \ref{frequency_ba} and \ref{velocity_ba}, respectively:

\[\label{e.frequency_ba} q^t_i=q_{min}+(q_{min}-q_{max})\beta,\]

\[\label{e.velocity_ba} \textbf{v}^{t+1}_i=\textbf{v}^t_i+(\textbf{x}^t_i-\textbf{g})q_i,\]

where \(\beta\sim{\cal U}(0,1)\), and \(\textbf{g}\) stands for the best solution (bat) found so far (similar rationale is also employed by Equation \ref{velocity_pso}).

The Bat Algorithm works with the definition of “temporary position", i.e., at each iteration we maintain two structures to store the current position of each bat: its usual position \(\textbf{x}\) and the temporary position \(\tilde{\textbf{x}}\), which is used to check whether the new generated solution is better than the previous one.

After computing the frequency and velocity using Equations \ref{e.frequency_ba} and \ref{e.velocity_ba}, we can start working with the movement of each agent, as follows:

\[\label{e.position_tmp_bat} \tilde{\textbf{x}}_i^{t+1}=\textbf{x}^t_i+\textbf{v}^{t+1}_i.\]

Further, we apply a random walk with probability \(r_i\) (also known as “pulse rate"):

\[\begin{aligned} \label{e.random_walk_bat} \tilde{\textbf{x}}_i^{t+1} & = & \left\{ \begin{array}{ll} \tilde{\textbf{x}}_i^{t+1}+\epsilon\bar{A}^t & \mbox{ with probability $r_i^t$} \\ \tilde{\textbf{x}}_i^{t+1} & \mbox{ with probability (1-$r_i^t$),} \end{array}\right.\end{aligned}\]

where \(\bar{A}^t\) stands for the average of the loudness considering all agents at iteration \(t\), and \(\epsilon\in[-1,1]\). Finally, the new position of each agent is then computed as follows:

\[\begin{aligned} \label{e.position_bat} \textbf{x}_i^{t+1} & = & \left\{ \begin{array}{ll} \tilde{\textbf{x}}_i^{t+1} & \mbox{ if $f(\tilde{\textbf{x}}_i^{t+1})<f(\textbf{x}_i^{t+1})$ and $x^i<A_i^t$} \\ \textbf{x}^{t+1} & \mbox{ otherwise} \end{array}\right.\end{aligned}\]

where \(\xi\sim{\cal U}(0,1)\).

Finally, we can also update the loudness \(A_i\) and pulse rate \(r_i\) at each iteration as follows:

\[\label{e.update_loudness} A_i^{t+1}=\alpha A_i^t,\]

\[\label{e.pulse_rate} r^{t+1}_i=r^0_i(1-e^{-\gamma t}),\]

where \(\alpha,\gamma\sim{\cal U}(0,1)\).

\label{s.ffa}

The Firefly Algorithm (FFA) (Yang 2010) is a meta-heuristic optimization technique inspired by the flashing behavior of fireflies, which attempt at flashing in order to attract possible mates. Let \(\textbf{x}_i^t\) be the position of firefly (agent) \(i\) at iteration \(t\). Roughly speaking, the idea of FFA is to move agents towards brighter ones, i.e., agents with lower fitness functions, as follows: \[\begin{aligned} \label{e.position_ffa} \textbf{x}^{t+1}_i & = & \left\{ \begin{array}{ll} \textbf{x}^t_i +\psi e^{-\tau d_{i,j}^2}(\textbf{x}^t_j-\textbf{x}^t_i)+\eta^t\bm{\lambda}^t_i & \mbox{ if $f(\textbf{x}_i^t)<f(\textbf{x}_j^t)$} \\ \textbf{x}^t_i & \mbox{ otherwise,} \end{array}\right.\end{aligned}\] where \(\bm{\lambda}_i\) is an array usually sampled from a Gaussian distribution, \(\eta^t\) stands for the step size, which can be linearly decreased, \(d_{i,j}\) stands for the distance between fireflies \(i\) and \(j\), and \(\psi\) is the so-called “coefficient of attractiveness". Recall that FFA is reduced to a variant of PSO when \(\tau=0\) (Yang 2010). Additionally, \(\bm{\lambda}\) can also