The FlexGP Project
Genetic programming is a mature, robust multi-point search technique (inspired by evolution) which supports readable, and flexibly specified learning representations which can readily express linear or non-linear data relationships. It is well suited to parallelization and machine learning. It has a strong record in real world domains.
The project designs and implements different distributed cloud models of genetic programming specialized for machine learning. Each instantiation is abstracted and programmed in such way that it can execute on a massively parallel platform - a private cloud, public cloud or open-infrastructure-volunteer compute network. It is able to elastically solve diverse problems that similarly require diverse solution approaches factored along conventional and new algorithm dimensions.
FlexGP is both the name of our project and the name of some (but not all) systems within the scope of the project.
Among our cloud-based systems is Cloud-FlexGP. The latest version, slickly and more inexpensively(!) executing on CSAIL's internal cloud and described in this publication, is launched and elastized in a completely decentralized manner. Its specialty is large scale ensemble regression. (Contact us for a pre-print about this method.) The first version, executing on EC-2, is an island model supported by Java socket communication. Our paper on it, entitled ''FlexGP: Genetic Programming on the Cloud',' won the Best Paper award in the 2012 EvoApplications conference on Parallel Architectures and Distributed Infrastructures ("EvoPAR").
Another system is the EC-Star FlexGP platform which is a massive-scale, hub and spoke, distributed rule-based GP classification system. It is documented in a paper presented at GPTP-2012. We are using EC-Star to solve supervised machine learning problems in the domain medical informatics. Specifically, we are working on the prediction of arterial blood pressure in critically ill patients in ICUs.
The project has invested in cloud-scaling a number of bio-inspired algorithms (e.g. particle swarm optimization, co-variance matrix adaptation Evolutionary Strategy). This paper documents the dynamic and randomized topology and migration strategy of the latter algorithm. We are actively using both these systems to develop scalable wind turbine layout optimization methods. Contact us for early release information on how cell development modeling solves this problem with these methods. More info on our wind research is here.
We have also developed a (less used) large-size, single-population GP which executes with Hadoop MapReduce. This system is named after the FlexEA library we developed for it. It is described here.
Our perspective: There's a lot of hype about the exponential growth of data. We take the growth as obvious and are interested (obviously)in the opportunities BigData presents for FlexGP.
The FlexGP project stresses scalability and, in the data realm, that implies design of ML systems that can handle lots of data. In this context, we're avoiding the "BigData" buzzword intentionally. That's because we feel it's (just) a matter of scale and we've been thinking about that all along. We feel the one important aspect of the data scaling situation in ML can be summarized as follows: Now we have too much data. How can we make sure we don't waste time looking at too much of it? Just because there's more compute resources to munch it, we shouldn't gorge! so, how can we determine when have we looked at enough of it?
These questions are essential because large quantities of data invert the old ML perspective. Before large scale data, we worried about how we split training and test data because we didn't have enough. Now, we've got to ask: when can we shout "enough already!"? It is important to remember that ML is about generalization. We want to infer, from exemplars, properties that are accurate in unseen data similar to our exemplars. So, we have to be a bit cautious given that when a dataset is infinite in a practical sense, it may present all truth. Working in this regime is kind of counter to ML's point.
We think this context implies we need to sample intelligently. Part of that intelligence involves estimating the properties of the data we've observed and calculating how we can proceed, with known certainty, to learn with out-of sample reliability. On the cloud-based FlexGP platform, we are investigating how to determine when/how we can be confident that the rest of the data more or less follows the properties we've observed in our sampling. Another part of the intelligence is sampling efficiently. Here, on the cloud-based platform, we are investigating sampling ideas based on distributed scaling. We will try to keep ourselves honest by insisting we count how many times we touch the data and trying to minimize this without excessive loss of information. Our EC-Star platform has a scalable answer to massive scale data by distributed random sampling with efficient sampling coordination which introduces more and more data to only promising solutions identified thus far.
Another aspect of our focus is investigating the effectiveness of massive data. In a limited data setting, one typically incorporates certain prior assumptions and insight about the data into the model (e.g. parametric). However, in the massive data setting, we can let the "data speak for itself" in order to discover truly data-driven knowledge and details beyond our insight. Thus, we are investigating novel and efficient representation and (nonparametric) modeling methods which ideally outperform in the massive (time-series) data setting. In addition, we also examine the question of the bias of the model versus the bias of the data in various scales of data.
You can never have too much data, but you can waste your time looking at too much of it. FlexGP systems exploit massive data by capitalizing on their computational scale and their scaling architectures (meaning, in FlexGP's case, our cloud or commercial-volunteer client resources). We aim to revel in the data, not drown in it!
Our Sponsors for this project include:
Project Keywords: cloud computing, volunteer computing, genetic programming, machine learning, MIMIC II medical database