Model Building for Data Analytics
Prerequisite – Life Cycle Phases of Data Analytics
Model Building :
In this phase data science team needs to develop data sets for training, testing, and production purposes. These data sets enable data scientist to develop analytical method and train it, while holding aside some of data for testing the model.
Team develops datasets for testing, training, and production purposes. In addition, in this phase, the team builds and executes models based on work done in the model planning phase. The team also considers whether its existing tools will suffice for running the models, or if it will need more robust environment for executing models and workflows (Example – fast hardware and parallel processing).
Free or open-source Tools :
Rand PL/R, Octave, WEKA, Python
Commercial Tools –
Matlab, STASTICA
Common Tools for the Model Building Phase :
R and PL/R :
They were described earlier in the model planning phase, and PL/R is procedural language for PostgreSQL with R. Using this approach means that R commands can be executed in the database.
Octave :
It is free software programming language for computational modeling, has some of functionality of Matlab. Because it is freely available, Octave is used in major universities when teaching machine learning.
WEKA :
It is free data mining software package with an analytic workbench. The functions created in WAKA can be executed within the java code.
Python :
It is programming language that provides toolkits for machine learning and analysis, such as scikit-learn, NumPy, scipy, Pandas, and related data visualization using matplotlib.
SQL :
SQL in database implementations, such as MADlib, provides an alternative to memory desktop analytical tools.
MADlib :
It provides an open-source machine learning library of algorithms that can be executed in the database, for PostgreSQL or Greenplum.
Lifecycle of Model Building –
- Select variables
- Balance data
- Build models
- Validate
- Deploy
- Maintain
- Define success
- Explore data
- Condition data
Data exploration is used to figure out gist of data and to develop first step assessment of its quality, quantity, and characteristics. Visualization techniques can be also applied. However, this can be difficult task in high dimensional spaces with many input variables. In the conditioning of data, we group functional data which is applied upon modeling techniques after then rescaling is done, in some cases rescaling is an issue if variables are coupled. Variable section is very important to develop quality model.
This process is implicity model-dependent since it is used to configure which combination of variables should be used in ongoing model development. Data balancing is to partition data into appropriate subsets for training, test, and validation. Model building is to focus on desired algorithms. The most famous technique is symbolic regression, other techniques can also be preferred.
Model validation is important to develop feeling of trust prior to its usage. The definition of good model includes robustness and well-defined accuracy. Therefore, trusted accurate model is potentially financial and physically dangerous too but trusted metric is very important for symbolic regression and stacked analytic networks.
Please Login to comment...