Enhancing Regression Models for Complex Systems Using Evolutionary Techniques for Feature Engineering

\section

Introduction \labelintro

Analytical models, as closed form solution representations, require specific knowledge about the different contributions and their relationships, becoming hard and time-consuming techniques for describing complex systems. Complex systems comprise a high number of interacting variables, so the association between their components is hard to extract and understand as they have non-linearity characteristics \citebar1997dynamics. Also, input parameter limitations are barriers associated to classical modeling for these kind of problems.

Otherwise, classical regressions as least absolute shrinkage and selection operator techniques provide models with linearity, convexity and differentiability attributes, which are highly appreciated for describing systems performance. However, the automatic generation of accurate models for complex systems is a difficult challenge that designers have not yet fulfilled by using analytical approaches.

On the other hand, metaheuristics are higher-level procedures that make few assumptions about the optimization problem, providing adequately good solutions that could be based on fragmentary information \citeBianchi:2009:SMS:1541534.1541554,Blum:2003:MCO:937503.937505. They are particularly useful in solving optimization problems that are noisy, irregular and change over time. In this way, metaheuristics appear as a suitable approach to meet optimization problem requirements for complex systems.

Some metaheuristics, as Genetic Programming (GP), perform Feature Engineering (FE) that is a particularly useful technique for selecting an optimal set of features that best describe an optimization problem. Those features consist of measurable properties or explanatory variables of a phenomenon. FE methods select adequate characteristics avoiding the inclusion of irrelevant parameters that reduce problem generalization \citeReidTurner19993. Finding relevant features typically helps with prediction; but correlations and combinations of representative variables, also provided by FE, may offer a straightforward view of the problem thus generating better solutions.

Grammatical Evolution (GE) is an evolutionary computation technique based on GP. This technique is particularly useful to solve optimization problems and provides solutions that include non-linear terms offering FE capabilities that remove analytical modeling barriers. One of the main characteristics of GE is that it can be used to perform Symbolic Regression (SR) \citeGenProg. Also, designer’s expertise is not required to process a high volume of data as GE is an automatic method. However, GE provides a vast space of solutions that may be bounded to achieve algorithm efficiency.

In this work we propose a novel methodology for the automatic inference of accurate models that combines the benefits offered by both classic and evolutionary strategies. Firstly, SR performed by a GE algorithm finds optimal sets of features that best describe the system behavior. Then, a classic regression is used to solve our optimization problem using this set of features providing the model coefficients. Finally, our approach provides an accurate model that is linear, convex and derivative and also uses the optimal set of features. This methodology can be applied to a broad set of optimization problems of complex systems. This paper presents a case study for its application in the area of Cloud power modeling as it is a relevant challenge nowadays.

\subsection

Motivation One of the big challenges in data centers is the power-efficient management of system resources. Data centers consume from 10 to 100 times more power per square foot than typical office buildings \citeScheihing:CreatingEnergyEfficient:07 even consuming as much electricity as a city \citeMarkoff:Intel:02. Consequently, a careful management of the power consumption in these infrastructures is required to drive the Green Cloud computing \citeBuyya:EnergyEfficientManagement:10.

Cloud computing addresses the problem of costly computing infrastructures by providing dynamic resource provision on a pay-as-you-go basis, and nowadays it is considered as a valid alternative to owned high performance computing (HPC) clusters. There are two main appealing incentives for this emerging paradigm: firstly, the Clouds’ utility-based usage model allows clients to pay per use, increasing the user satisfaction; secondly, there is only a relatively low investment required for the remote devices that access the Cloud resources \citeChen:ProfilingVMs:11.

Besides economic incentives, the Cloud model provides also benefits from the environmental perspective, since the computing resources are managed by Cloud service providers but shared among all users, which increases their overall utilization \citeBerl:EECC:2010. This fact is translated into a reduced carbon footprint per executed task, diminishing $CO_{2}$ emissions. The Schneider Electric’s report on virtualization and Cloud computing efficiency \citeSchneiderReport confirms that about 17% of annual savings in energy consumption were achieved by 2011 through virtualization technologies.

However, the proliferation of modern data centers is growing massively due to the current increase of applications offered through the Cloud. A single data center, that houses the computer systems and resources needed to offer these services, has a power consumption comparable to 25000 households \citeKaplan_Forrest_Kindler_2008. As a consequence, the contribution of data centers in the overall consumption of modern cities is increasing dramatically. Therefore, minimizing the energy consumption of these infrastructures is a major challenge to reduce both environmental and economic impact.

The management of energy-efficient techniques and aggressive optimization policies requires a reliable prediction of the effects caused by the different procedures throughout the data center. Server heterogeneity and diversity of data center configurations difficult to infer general models. Also, power dependency with non-traditional factors (like the static consumption and its dependence on temperature, among others) that affect consumption patterns of these facilities may be devised in order to achieve accurate power models.

These power models facilitate the analysis of several architectures from the perspective of the power consumption, and allow to devise efficient techniques for energy optimization. Data center designers have collided with the lack of accurate power models for the energy-efficient provisioning and the real-time management of the computing facilities. Therefore, a fast and accurate method is required to achieve overall power consumption prediction.

The work proposed in this paper makes substantial contributions in the area of power modeling of Cloud servers taking into account these factors. We envision a powerful method for the automatic identification of fast and accurate power models that target high-end Cloud server architectures. Our methodology considers the main sources of power consumption as well as the architecture-dependent parameters that drive today’s most relevant optimization policies.

\subsection

Contributions Our work makes the following contributions:

•

We propose a method for the automatic generation of fast and accurate models adapted to the behavior of complex systems.
•

Resulting models include combination and correlation of variables due to the FE and SR performed by GE. Therefore, the models incorporate the optimal selection of representative features that best describe system performance.
•

Through the combination of GE and classical regression provided by our approach, the inferred models have linearity, convexity and differentiability properties.
•

As a case study, different power models have been built and tested for a high-end server architecture running several real applications that can be commonly found in nowadays’ Cloud data centers, achieving low error when compared to real measurements.
•

Testing for different applications (web search engines, and both memory and CPU-intensive applications) shows an average error of 3.98% in power estimation.

The remainder of this paper is organized as follows: Section LABEL:relWork gives further information on the related work on this topic. Section LABEL:algorithm provides the background algorithms used for the model optimization. The methodology description is presented in Section LABEL:probMod. In Section LABEL:caseStudy we provide a case study where our optimization modeling methodology is applied. Section LABEL:results describes profusely the experimental results. Finally, in Section LABEL:conclusions the main conclusions are drawn.