The goal of the ingestion process is to take raw, compressed, and/or encrypted files and output plain text files that are ready to be read into Hive, Athena, etc. The ingestion process should be able to handle files of any size or format with little or no extra configurations. An attempt should be made to utilize serverless processes as much as possible and limit the number of perpetual instances by utilizing managed services when needed. This will reduce costs and reduce the amount of time needed to administer and maintain resources.
Using AWS Batch and Lambda
The ingestion process starts at the creation of the object in the S3 landing directory and ends with the decompressed, decrypted file. It does not ingest data into any database or do any further transformation of the data. By decoupling the actual ingestion of the file into a database, you allow for a more scalable and modular solution. It allows for multiple platforms to read and process the output files rather than being constrained by a single solution.
Key Differences: Azure vs. AWS
Most legacy ingest processes that are being used in the Azure environment have many constraints that limit efficiency and scalability. Below are some of the main limitations of these types of environments and how we alleviate them and create new processes.
One of the main constraints of ingest is where all the files were being processed. Often all files are being ingested in the same instance that is running the data science workflows. This places unnecessary load on an instance and, in some cases, limits the invocation of workflows due to concurrent job limits.
In AWS, this was addressed by using Lambda and Batch instead of the DSS instance itself. The CPU and memory intensive work of extracting and decrypting the files has been completely offloaded from the DSS machine. Using Lambda allows for massively concurrent processing of smaller files with a cost of fractions of a penny per file. Batch is utilized for larger files to allow for the extra storage capacity and memory needs that Lambda doesn’t offer.
Duplication of Data and Processing
In Azure, there was no shared storage that could be accessed by both environments. Therefore, the data transfer, processing, and loading had to occur in both environments, even though the same exact data was being used. This doubled the processing capacity needed and introduced additional breakpoints into the ingest.
Amazon S3 gives us the ability to share data across all of our environments. Processing of a source file now only has to occur once. This will reduce costs, limit the amount of oversight that is needed for monitoring and maintenance, and increase reliability by removing breakpoints and simplifying the overall process.
By using a perpetual instance to ingest all files in Azure, we were limited on the number of files we could process at one time. In this situation, the only way to gain more concurrency is to scale vertically by increasing the size of the instance that was being used and utilize all cores on the machine. This approach is both expensive and wasteful especially when the extra capacity was only needed during certain times of the day.
Using Lambda and Batch, we allow for massive, parallel operations processing thousands of files at any one time. Lambda’s concurrency is limited only by the account’s total concurrent lambda function limit which can be increased easily by AWS with a strong use case. With Batch, the only limitation is the maximum amount of vCPUs that you specify for each computing environment. Therefore, whether processing 10 files or 100,000 files, no additional capacity or modifications are needed.
Monitoring and Management
There was very little monitoring and error notification in Azure. In most cases, the only time we discovered issues was when an end user notified us that there was data missing. By that time, to fix the issue was both time consuming and difficult. There was also no easy way to track incoming files and determine if they were successfully received and processed.
File auditing and error notification is an important consideration in a new architecture. Easily being able to debug file ingest errors will save both time and money and improve the end-user confidence in the system. All files should be tracked and errors on ingest will be immediately reported to the team.
Considering a new analytics platform or struggling to accelerate your analytics delivery? Let’s talk.