March 23, 2018

Analytic DevOps in AWS- Part 1

For many of our clients, the early days of analytics are often filled with ad-hoc analytic development processes.  Each project is tackled in a different manner, no standard analytic development process exists. Data validation, analytic validation, and QA are rarely part of the process or project plan and development and production environments are one in the same – often someone’s work laptop.  While this approach works fine in the early days and allows analytic groups to react quickly to business needs, as the analytics organization matures, and data-driven decisions become a reality, the need for a standard, robust, and reliable analytics DevOps process grows.

The Analytics DevOps Framework

At WebbMason Analytics, we define Analytics DevOps as a framework that ensures trusted, validated analytic capabilities are developed and deployed in a transparent, governed, repeatable, and highly automated approach into a secure, scalable, and robust production environment.  There are many components required to make this a reality, but the two we will focus on in these blog posts are the Analytics DevOps Process, and the Analytics DevOps Architecture.

Analytics DevOps Process

The purpose of the Analytics DevOps process is to ensure that business stakeholders have the necessary, reliable information to make confident, data-driven decisions. This process outlines a step-by-step workflow from discovery through the release of data assets and decision support capabilities.  Several areas of validation are performed to ensure consistency and quality of data being produced. The process also outlines the steps to deploy from the development environment to production. This process is represented below in a linear format. However, it is expected and encouraged that this process executes in an agile fashion and that after the final step of gathering the feedback from end users, the process would restart to better refine and adapt to the user’s needs.

Discovery

During the discovery phase, the Data Engineer or Data Analyst is engaging with business stakeholders. This engagement involves defining specific business and technical requirements, including the definition of the final analytic product (usually in the form of a dataset, report, or dashboard). Furthermore, the expected timeframe for completion is discussed to prioritize resources and manage expectations.

After the Data Engineer has gathered enough information from the business unit, they determine the requirements of the dataset to support that use case and develop a dataset contract. Areas that will be explored can include but are not limited to:

  • Current data that is available
  • Frequency and history of the available datasets
  • Data that is not available but can become available through an additional data feed
  • Current datasets that are already in production
  • Desired schema to support the use case
  • The expected size of datasets

In some cases, the Data Engineer will determine that there is insufficient data to support the requested use case.  At this time, the Data Engineer will confer with the business to adjust their expectations or cancel the effort altogether.

Once the final dataset contract is developed, the data engineer gains approval to proceed from the business stakeholders.

Development

During the Development phase, the Data Engineer will analyze the needed data sources and implement any additional data feeds. Initializing additional data feeds will require inter-team and in some cases inter-company, collaboration. This is often a significant time expense and should start as soon as the Data Engineer identifies the need for additional feeds.

After all necessary data sources are available, the Data Engineer will begin to develop the workflow to produce the desired dataset(s). Workflow automation or analytics workbench technologies will be utilized to produce and orchestrate the workflow. This portion of the development can take several days or several months and can employ various technologies such as Hive, Spark, Python, R, Athena, and Redshift.

Following the initial build of the dataset, work can begin on the decision support solution, either in the form of a report, dashboard, or API. For this post, we will focus on deploying a Tableau Workbook, as an example.  The Data Engineer will create a new workbook using Tableau Desktop and configure the data source to the newly built dataset. From there, the Data Engineer or Analyst will develop the initial dashboard using this data source.  A skilled Tableau developer can usually accomplish this task within a few days to a few weeks.

Initial Validation

Once the dataset is built, the Data Engineer will perform an initial round of validation.  Upon satisfactory validation, the Data Engineer will place the dataset into QA and an additional Data Engineer will conduct further validation.   All validation efforts, including datasets and Tableau dashboards, will follow the acceptance criteria developed during the discovery and development phase.  Business stakeholders will conduct UAT of all validated artifacts.

Following the successful validation of the underlying dataset, the Tableau workbook will be validated. As with the dataset validation, the Data Engineer who created the workbook will perform an initial validation.  After satisfactory validation, the Data Engineer will then place the workbook into QA and upload to Tableau Server. An additional Data Engineer will conduct further validation. Business stakeholders will again conduct UAT of all validated artifacts.

Any discrepancies found during validation will be immediately addressed by moving back to the Development phase to correct the issues.

Deployment

When moving into the deployment phase, it is expected that both the underlying dataset and the Tableau workbook have been successfully validated using real data.

The Data Engineer will package up the workflow automation solution.  For example, using an analytics workbench technology such as Dataiku’s Data Science Studio (DSS), the data engineer would package (or bundle ) the DSS project in the Design node. They will then import that bundle into the Automation node, which is hosted on a separate instance. After the project has been imported, the Data Engineer will then run the workflow to build the dataset and schedule any subsequent automated builds. The workflow should be monitored for a few days after initialization in the Automation node to ensure proper function.

In order to deploy the Tableau workbook to production, the Data Engineer will download the current workbook from Tableau Server to Tableau Workbook.  They will then switch the data source to the production dataset. After verifying the dataset is populating, the Data Engineer will publish the workbook to Tableau Server.

Final Validation

The final validation follows the same format as the initial validation.  The Data Engineer will perform validation on the production dataset. They will then place the dataset into QA for another Data Engineer to validate.  Last, business stakeholders will conduct UAT of validated artifacts. If the dataset passes both sets of validation, the dataset can then be considered to be ready for release.

Following the successful validation of the production dataset, the Data Engineer will perform validation on the Tableau workbook.  The Data Engineer will then place the workbook into QA for further validation by an additional Data Engineer. Last, business stakeholders will conduct UAT of validated artifacts.  After successful validation, the workbook will be considered ready for release.

If either the dataset or the Tableau workbook fails validation, the Data Engineer will return to the Development phase to resolve any issues.

Release and Monitor

Following the successful validation of the production dataset and the production workbook, the Data Engineer will inform the business stakeholders of the published workbook through a defined release process.  The Data Engineer will follow an established process for ‘publishing’ the final product, including migrating the data set into a dedicated zone If necessary, the Data Engineer, or other groups, will adjust permissions to allow access to this new workbook. Once the data set and Tableau dashboard are released, ongoing monitoring will be performed. Areas that should be monitored include but are not limited to:

  • Performance of the workbook
  • Performance of the underlying DSS workflow
  • Number of users utilizing the workbook
  • Issues related to the data sources and ingest

Following a reasonable amount of time, the Data Engineer will engage business stakeholders for feedback.  After receiving the feedback the Data Engineer will then move back to the Discovery phase to further enrich the user experience.

Throughout this entire process, documentation templates should be pre-defined, populated, and stored in a central repository.  Many of our projects leverage Atlassian Confluence for this task and we have pre-built a number of documentation templates, including data set contracts to support collaboration between data engineers and business stakeholders, dashboard mockups/requirements, and data pipeline/workflow documentation.

Summary

As analytic groups begin to scale, and data & analytics governance begins to become a concern, implementing a formal Analytics DevOps process can ensure consistency across analytics projects and support a diverse, distributed team of analytic professionals and business stakeholders.

Read Part 2 – Analytic DevOps in AWS