May 18, 2018

Navigating Data Science Projects Part II – Data Engineering

In our previous article post, we discussed the role data plays in any data science project.  In this post, we will focus on the activities that occur once data has been collected. In most scenarios, we partner a data engineer with a data scientist.  This approach allows us to scale and optimize the skill sets of data engineers and data scientists.

Data checks and descriptives

Once data has been accessed and reviewed with the business, the next step is to document the data available. Business definitions for columns in tables, entity relationship diagrams for relational databases, points of contact where information was obtained from, etc, should be thoroughly documented.  Our preference is to leverage tools like LucidCharts and Google Drive to support high-productivity collaboration on this documentation and to leverage Atlassian Confluence in order to centralize the knowledge base and distribute information across a diverse data science team.

Descriptives of data for all dimensions involved need to be generated, i.e., any statistical index to help us get an idea of the distributions for the data in reference. Examples include:

  • Mean
  • Median
  • Mode
  • Density plots for continuous variables
  • Histograms for categorical variables with regards to the response variable of interest.

At the same time question(s) regarding integrity and validity of the data should be addressed.  This includes:

  • What is the meaning of NULLs, NAs, 0s?
  • What about different data formats encountered or gaps in the data timeline?
  • Are date and time columns consistent and accurate?
  • Can the client or business stakeholder possibly provide feedback on the change of data processes which might have taken place during the data collection phase that could help explain the strange behaviors observed in the data?

Next, the data needs to be checked for outliers. There are several methods to identify outliers, depending on the distribution of the data involved. Some are purely statistical and others machine-learning based. Once outliers have been identified, the client or business stakeholder needs to be informed about their existence, and a sample of such outliers needs to be shared so it can be confirmed they coincide with the business criteria for identifying outliers. Finally, a decision needs to be reached on how to address them.

Feature engineering

In the context of forming the most informative set of predictive variables, we usually need to employ all business knowledge available to come up with either transformed variables or to generate new ones we feel may encapsulate most of the variance of the dependent variable in reference. This step involves addressing questions such as:

  1. Does the data need to be transformed to be useful?
  2. If so, what type of transformation(s) do we need to apply?
  3. Can we think of new metrics/variables we may be able to construct from existing ones?
  4. Do we need to enrich our data set with external data from third parties?

Once these questions have been addressed, a data engineer will execute and automate the transformations using technologies available.  This may include analytic workbench technologies or native languages and code deployment solutions.

Summary

The data we collect from the client, while a good starting point, is rarely in a format ready for data science solutions.  Time must be invested to carefully analyze and create data that accurately represents the business problem. Partnering a data engineer with a data scientist ensures that your data scientists maximize their time modeling and not data wrangling.  Clear communication channels and roles/responsibilities between the data engineer and data scientist ensures this partnership operates efficiently. Finally, one can greatly improve the chances for success in the Modeling phase by implementing a validation step with the client to ensure the data has been appropriately interpreted and used per business context.

Continue Reading: Navigating Data Science Projects Part III – Model Development