June 1, 2018

Navigating Data Science Projects Part III – Model Development

In the previous article posts, we focused on data and data engineering – specifically their roles in a data science project.  In this post, we will focus on model development.

Modeling

Modeling is a wide term denoting methods (based on statistics, mathematics, machine learning or combinations thereof) used to address questions related to process dynamics in various business contexts such as:

  1. Why do things happen the way they do in the context of a given, measurable business process?
  2. Is there a way for us to control a certain business process?
  3. Can we forecast what the next value or the next set of values is going to be, given a series of historical values for our variable of interest?
  4. Can we segment our measurements/observations based on a set of business criteria?
  5. Can we extract knowledge out of our document repository?

Typically, the types of machine learning algorithms used to address (1, 2, & 4) above, fall under the broad categories of:

  • Supervised, where labeled data is available (naive Bayes, random forest, linear regression, gradient boosting, neural nets, etc)
  • Unsupervised, where labeled data is not available (k-means clustering, hierarchical clustering, association rules, etc)
  • Semi-supervised, when some labels – but not enough – are available. These methods at the end of the day boil down to one of the two previous categories
  • Reinforcement Learning: this is an iterative type of learning process, which aims at optimizing some selected cost function (maximize reward or minimize loss as the case might be)

The questions in modeling categories (1 & 2):

  1. Why do things happen the way they do in the context of a given, measurable business process?
  2. Is there a way for us to control a certain business process?

These can be cast as classification or regression problems if labeled data exists. The formerpertainsn to predicting the class of a binary or multi-class categorical dependent variable (YES or NO, red, blue or white, etc), while the latter to predicting a continuous dependent variable, given a number of independent ones.

Examples of regression modeling:

  • How does the number of defective manufactured parts depend on temperature?
  • How does signal strength vary by distance and type of receiver?
  • How do wages vary by education level, age, location and type of occupation?

Examples of classification:

  • Can we predict whether a client will churn given his/her profile (e.g. age, income and location)?
  • Should a credit card be issued given a specific applicant profile?

Questions of modeling category (4):

  1. Can we segment our measurements/observations based on a set of business criteria?

can be addressed via classification or regression problems, if labeled data are available, or as clustering problems if labeled data are not available.

Questions pertaining to modeling category (3):

  1. Can we forecast what the next value or the next set of values is going to be, given a series of historical values for our variable of interest?

This can be addressed via time series analysis. This is a phenomenological approach in the sense that, in its basic form, it does not provide insights for relationships between the dependent variable and possible independent predictors.

Examples of time series analysis include:

  • What will demand be in three days
  • What will our competitor’s price be in seven days

I left modeling category (5):

  1. Can we extract knowledge out of our document repository?

This involves a different set of tools. Document repositories may contain text, labeled images, captions, etc. The general question in such cases is: “Can we generate an independent and autonomous process that can analyse a corpus of documents and provide insight on the contents?”

Model tuning

Model tuning has to do with the level of accuracy sought. The term “accuracy” is used loosely here, as each model may employ a different metric as a measure of “accuracy”. Indicatively, such metrics can be:

  • The root mean square error (RMSE)
  • The receiver operating curve (ROC)
  • The area under the ROC curve (AUC)

Whereas getting to a modest “accuracy” level can be achieved at the first stages of modeling, increasing the “accuracy” level may require disproportionate amounts of additional effort, time and resources. This can be due to gathering and using more data, to generating new features, to trying different data transformations or to adopting entirely new models and derivatives thereof.

Setting up prototype tests and deployment

At the end of the day, whatever outcome the data science aspect of a project generates, it needs to be measured against a baseline. The baseline can either be some client-established method and associated metrics, or independently established.

Regardless of the baseline type, a time interval needs to be set aside to run the modeled process and compare results in real-life settings. For example, if we generate a stock price model forecasting certain stock prices 3 days ahead, we need to test how our forecasts compare to real-life prices as posted on a daily basis for a number of days. In the general case, the amount of time needed to carry out such tests can be affected by a number of factors (confidence interval chosen, the frequency of data sampling, etc). The rule of thumb is that the longer the time allotted to testing, the more reliable the test results are expected to be.

Once prototype tests have concluded successfully, the model enters its production phase. It now needs to be converted to a scaled version in the language of choice as per business requirements. Some considerations include:

  • Should the product version of the model be a plain old java object (POJO), a Python module, an R .rda/.rds?
  • What infrastructure is it supposed to run on?
  • Is there a need for a cluster environment (for example any combination of Hadoop / Spark / h2o clusters, etc), or would a stand alone server do?
  • What memory and network requirements should be taken into account for production performance, given the size of the training set and the frequency of generating an updated model?

This is a non-exhaustive list of questions that need to be addressed well in advance and preferably as early as the project requirements phase.  Leaving these questions to the very end introduces significant risk to deploying the model into production.

Model maintenance

Model maintenance is related to gauging the validity of the model in time. Specifically, the validity of a model generated from a set of given data is not generally expected to be constant in time. As a matter of fact, if the current data has a different distribution than the historical data the model was built on, the model may start to falter. It is therefore necessary that the client or business stakeholder  is informed about this aspect of modeling and that some form of maintenance involving, among other things, re-modeling based on new data, re-tuning the model, etc, is in place.

Summary

In closing, we should emphasize that it is usually the case that each data science project is unique, in the sense of posing a new problem asking for a custom solution. It is therefore necessary that all points presented across the three blog posts (Data, Data Engineering, Model Development) should be refined and viewed under the special circumstances, constraints and conditions of a specific project in reference.