AWS multi-account data platform monitoring
Data platform monitoring in AWS means at least the following separate things:
- AWS component-level monitoring.
- Data flow monitoring with impact analysis.
- Data quality monitoring with impact analysis.
Within my current company VR, we have decided to create a data platform with a multi-account approach: dozens of accounts to be monitored and data pipelines that go through multiple accounts. A typical example pipeline would be following:
Lambda (account 1) fetches data and saves it to s3 (account 2). From there SNS-trigger notifies new lambda (account 3) to fetch a new file and do something to it. Cloudwatch is used for monitoring within the accounts.
This means that data needs to be monitored within different accounts and lineage between the accounts should be monitored as well.
Lineage is here the main thing: to be able to understand where some component affects we need to fully understand the data pipelines through our accounts. Also, we should be able to correlate alarms: we want to eliminate duplicate alarms for instance in a case where data is missing in S3 and we know that the Lambda that delivers the data is getting exceptions.
Seems quite straightforward? When you think deeper you start to see problems:
- Integration of new accounts should be at least semi-automatic.
- Alarms should be at least defined in one place and monitoring should be set up automatically for each created component. For example, all our Kinesis pipelines would have the same alarms and we would get automatically new Kinesis pipelines monitored by the same rules.
- How to monitor data lineage? → You would need master data with lineage and the tool should support some enrichment of alarms on top of this.
- If you have a Kinesis pipeline where is normal to have irregular breaks in data flow? → You would need a dynamic threshold for each pipe to be able to monitor data flow correctly.
- If you have data coming at longer intervals than 24 hours?
- Where to set data quality rules? Are there even tools that support checking data from s3 for example?
- We want to be able to know do we meet our SLA’s → tool should keep the alarm state as NOK if there isn’t data coming all the time.
- What if the business wants dashboards on top of monitoring?
- We should be able to monitor also EDW as part of the data pipeline.
- Terraform should be supported as a deployment tool for new alarms & dashboards.
- ... and of course, the tool should support multiple alarm targets (“slack”, automation, dashboards, ticketing tools...).
First, we compared three different commercial tools that focus on monitoring cloud solutions and promises to be able to handle multi-account monitoring:
These all can monitor multiple accounts and promises easy integration of new accounts. These all support dashboards and multiple targets for alarms. Some are more focused on log analysis but all support also Cloudwatch metrics. Finally, all of these are promised to have some AI/ML for dynamic thresholds.
→ On paper, all should meet at least most of the requirements.
But there were multiple problems found:
- The main problem comes with lineage itself: tools don’t support enrichment at the level we need it (explained more detailed under).
- Data quality monitoring should be created outside of these solutions which means additional coding anyway: for example, if we want to be able to monitor event content quality within Kinesis in real-time? Or empty files arriving at S3?
- Dynamic thresholds / AI solutions had problems: some didn’t fully support Terraform, some others were clearly in the beta stage.
- Keeping the alarm state if data is not coming (for example, irregular alarm arriving in S3 and Lambda triggering by SNS) seemed to be a problem in many solutions…
So we decided to create the monitoring on our own from scratch but with the support of commercial tools later on.
The first thing to solve was the metadata. How to keep lineage metadata updated all the time automatically from all of our accounts?
We rely heavily on Terraform modules and of course, Terraform knows a lot of permissions and actual components. We decided to create a monitoring account that has enough access to fetch all account’s Terraform tfstate’s. From there it was quite easy to build lineage through accounts (Kinesis, Kinesis Firehose, S3, Lambda, ECS...)
Okay, now we have lineage built. The next thing would be to build a monitoring solution that meets the requirements. We built a solution with the following main components:
- Actual alarms are raised by Cloudwatch alarms within individual accounts or by commercial tools.
- AWS component-based alarms are created in Terraform modules and this means they are standardized through our accounts. Alarms are sent to SNS-topics within the monitoring account.
- EDW exports its own alarms.
- Data quality alarms are custom-made.
2. Common alarm processor (Lambda)
- Receives alarms from different sources.
- Enriches alarms from metadata to know where alarm affects.
- Correlates alarms by lineage.
- Saves the state of alarms.
- Sends alarms/acknowledges to different alarm consumers.
3. State and dashboards for real-time monitoring (DynamoDB + custom UI)
- State of alarms is kept in DynamoDB.
- UI fetches data from DynamoDB and shows the real-time status in a dashboard.
- SLA-monitoring is done with BI tools based on data fetched from DynamoDB.
Conclusion
After 6 months seems like we made a good decision:
- Lineage/enrichment is still the biggest problem for commercial tools, but of course, we are constantly evaluating commercial tools for suitable lineage support.
- Commercial tools are still having most of the same problems even without lineage. If there comes a nice solution for example for dynamic threshold monitoring we can integrate those alarms into our enrichment module anyway as a new alarm producer.
- The custom solution adds flexibility: If we need for example monthly-level monitoring in some S3 bucket/prefix or some new data quality monitoring, we can do that.