The scope of this document is to expose basic notions related to data science projects. It is expected that what is covered will be useful for analytics leaders, data analysts, consultants and data engineers. An attempt is made to present connections between the different phases typically comprising a data science project, as well as the main ideas in each one of them. Although many of the facets covered appear in the Cross-Industry Standard Process for Data Mining (CRISP-DM), the emphasis here is given to aspects related mainly to the client or business stakeholder – data science team communication channel(s). What should data science consultants, data analysts, and engineers be focusing on, while communicating and being in sync with the client or business stakeholder in data science-oriented projects?
We have divided this discussion into three parts: Data, Data Engineering, and Model Development. This first part focuses on data.
Although “big data” is the term most often encountered nowadays in data science-related projects, data can come in various sizes. Big data are characterized mainly by:
- Volume: data set size exceeds the order of gigabytes (or more), depending on the domain under consideration
- Variety: data sets could be originating from disparate sources, might differ in nature (text, image, waveforms, etc) and could, in general, be structured, semi-structured or unstructured
- Speed: i.e., how often new data sets are acquired (could be sampling every millisecond or every hour, day, week, month, year etc)
However, after filtering, sub-setting, sampling, and aggregation have taken place on a big data set, it is quite probable that the final data size will be considerably smaller.
Two complementary schools of thought are present when it comes to data size considerations. The first tends to favor sampling and statistical treatment regarding metrics of interest, whereas the second resorts to brute force handling of big data to extract metrics referring to the entire body of data available. As mentioned, these methods should be viewed as complementing, rather than competing with each other. The context and constraints of the project will dictate what the best approach is, and it is usually the case that a mixture of both works best (for example, get dataset descriptives on the entire data set but try neural net configurations and hyper-parameter search on samples).
Data time reference
As far as the time scale of data is concerned, it is usually the case that data are obtained as:
- Real-time/streaming: for example, such as the case with an Internet of Things (IoT) data sets
- Historical: data stored in connected or disconnected repositories
Clarifying the nature of datasets
The following questions need to be addressed in consultation with the client or business stakeholder:
- What is actually there (i.e. what is the nature of the data in reference)?
- Inconsistencies in the data the client or business stakeholder may be aware of (i.e. change of data collection methods in time, different data formats used, non-continuous data streams, etc.)
- How is the data stored and what are the data governance rules in place regarding accessing and sharing?
- Is the data:
- Structured (data residing in relational databases, etc)?
- Semi-structured (xml, json, etc)?
- Unstructured (text, videos, photos, audio files, etc)?
- Some combination of the above?
This feedback is to be re-evaluated and confirmed once the data analysis starts.
Data to consider
Once the business questions have been formulated, the subject of what are the preferred data sets to use emerges. In consultation with the client or business stakeholder, the following need to be determined as far as data sets to consider:
- Comprehensive option: include all the data pertaining to the questions to be addressed, so that we have a holistic view of the data
- Short of (1), we can determine, in consultation with the client, what is the best sample or subset of data to use
- In any case, personally identifiable information (PII) and business sensitive information should be removed, if possible, before data reaches the analysis phase and certainly before data is shared beyond protection constraints
- Equally important: are there data establishing some sort of baseline, pertaining to the business problem(s) we are to address? For example, if the client or business stakeholder currently uses some internal forecasting method for predicting what call volume is expected 3 days ahead of a given date, it would be very useful to have this information and establish a forecasting baseline to compare our models too.
At the end of the day, access to the data set(s) agreed upon will be needed. Typically, this can be achieved through:
- Simple file transfer
- VPN access granted by the client or business stakeholder to data analysts/engineers to a data repository
- Client laptop with the data sets loaded on it
- Data dump to the cloud (i.e. AWS S3, etc)
- In some cases, depending on data volume and time constraints, it may be preferable that data are physically transferred on a medium from client premises to the location the analysis is going to take place.
Data plays a critical role in the success of any data science project. Accessing, understanding, and evaluating data early in the process eliminates many of the risks associated with activating analytic insights.
In the next blog post, Navigating Data Science Projects Part II – Data Engineering, we will discuss data engineering efforts conducted during a typical data science project.