Grafana is an open-source platform used for monitoring, visualization, and alerting on metrics and log data. It allows users to create interactive and customizable dashboards that integrate data from various sources, such as Prometheus, InfluxDB, Elasticsearch, and many others. Grafana is widely recognized for its powerful visualization capabilities, enabling teams to gain real-time insights into the performance and health of their systems.

Why Coupling Grafana with Prometheus is a Powerful Choice


When building a robust monitoring and alerting system, coupling Grafana with Prometheus is a popular and powerful choice. These two open-source tools work seamlessly together to provide comprehensive insights into your infrastructure and applications.

How Grafana and Prometheus Work Together


Grafana’s primary strength lies in its ability to create customizable and interactive dashboards, which, when connected to Prometheus, can visualize the metrics collected by Prometheus. Grafana can query Prometheus to display time-series data in various formats, including graphs, heatmaps, and tables, making it easier to understand trends, spot anomalies, and diagnose issues in real-time.

Prometheus also includes a built-in alerting mechanism called Alertmanager, which allows you to define alerting rules based on specific conditions in your metrics data. When combined with Grafana, you can not only visualize these alerts but also manage and refine them directly through Grafana’s interface, creating a centralized monitoring and alerting system.

PromQL, the query language used by Prometheus, supports complex queries on the stored time-series data. Grafana leverages this by providing an intuitive interface where users can build and test PromQL queries, making it easy to create and adjust dashboards on the fly. This flexibility is essential for tailoring monitoring to specific needs.

Both Grafana and Prometheus are highly scalable and can be configured to monitor large, distributed systems. Prometheus can scrape metrics from thousands of targets, and Grafana can aggregate data from multiple Prometheus instances, enabling large-scale monitoring across different environments.

Grafana’s plugin ecosystem further extends its functionality, allowing you to pull in data from other sources alongside Prometheus metrics. This can be particularly useful if you want to correlate metrics from Prometheus with logs from Loki or traces from Jaeger, all within the same Grafana dashboard.

The Ease of Deploying Prometheus and Grafana in Kubernetes


However, even with all these benefits, deploying monitoring tools like Prometheus and Grafana in Kubernetes environments can be challenging due to the complexity of configuration and integration.

Helm: Simplifying Deployment


Helm is a package manager for Kubernetes that enables users to define, install, and manage Kubernetes applications. Helm charts are pre-configured templates that encapsulate all the necessary Kubernetes manifests (like deployments, services, and config maps) needed to run an application. This makes it easy to deploy complex applications with a single command, ensuring consistency and reproducibility.

Benefits of Deploying Prometheus and Grafana with Helm Charts


  • Simplified Deployment Process: Helm charts for Prometheus and Grafana are widely available and maintained by the community. These charts come with default configurations that work out-of-the-box, making the deployment process as simple as running a single Helm command. This eliminates the need to manually write and configure Kubernetes manifests, saving time and reducing errors.
  • Consistent and Repeatable Deployments: Helm charts encapsulate the configuration required to deploy Prometheus and Grafana, ensuring that deployments are consistent across different environments (e.g., development, staging, production). This consistency is crucial for maintaining stability and reliability in your monitoring setup.
  • Customization Through Values Files: Helm allows users to override the default configuration provided by the charts using values files. This means you can easily customize aspects of your Prometheus and Grafana deployments, such as resource limits, replica counts, data persistence, and alerting rules, without having to modify the underlying charts. You simply provide your custom configurations in a `values.yaml` file.
  • Easy Upgrades and Rollbacks: Helm simplifies the process of upgrading or rolling back Prometheus and Grafana deployments. When a new version of Prometheus or Grafana is released, you can update your deployment with a simple Helm upgrade command. If anything goes wrong, Helm’s rollback feature allows you to revert to a previous version effortlessly.
  • Seamless Integration with Kubernetes: Helm charts for Prometheus and Grafana are designed to integrate seamlessly with Kubernetes. For example, Prometheus can automatically discover services within your Kubernetes cluster that expose metrics, thanks to Kubernetes service discovery. Grafana, on the other hand, can be configured to automatically pick up data sources and dashboards that are deployed alongside it.
  • Community Support and Best Practices: The Prometheus and Grafana Helm charts are maintained by active open-source communities, meaning they are regularly updated to incorporate the latest features, best practices, and security patches. This ensures that you are always deploying a well-tested and secure monitoring stack.

Deploying Prometheus and Grafana with Helm


To deploy Prometheus and Grafana using Helm, you typically follow these steps:

Add the Helm Repository:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

Install Prometheus:

helm install prometheus prometheus-community/kube-prometheus-stack

This command deploys a full Prometheus monitoring stack, including Alertmanager, node exporters, and the Prometheus server itself.

Install Grafana:

helm install grafana grafana/grafana


By default, this will deploy Grafana with a basic configuration. You can customize your installation using a `values.yaml` file if needed.

In fast-moving DevOps and FinOps environments, it’s easy to lose track of stopped or non-compliant cloud resources. While Lambda or cron jobs can detect these resources, what matters just as much is where those results go. I don’t want alerts buried in email or tucked away in an S3 bucket—I want actionable messages delivered straight to my team’s Microsoft Teams channel.

This post focuses specifically on using Python to craft and send structured messages to Microsoft Teams using an Incoming Webhook.

Why Use Microsoft Teams Webhooks?

When integrated properly, Microsoft Teams becomes a lightweight dashboard for DevOps alerts. By posting messages through Teams webhooks, I can:

  • Provide quick visibility into weekly scans

  • Include links to S3-hosted reports

  • Highlight potential cost savings

  • Enable real-time discussion before any action is taken

Constructing the Message Card in Python

To get Teams to accept our message, we need to follow the schema for MessageCards. The Python script constructs the card as a dictionary and encodes it as JSON.

card = {
    "@type": "MessageCard",
    "@context": "http://schema.org/extensions",
    "summary": "AWS Resource Compliance Report",
    "sections": [{
        "activityTitle": "⚠️ AWS Resource Compliance Notification",
        "activitySubtitle": "Resources requiring attention",
        "text": "This report lists AWS resources that require review for potential compliance issues.",
        "facts": [
            {"name": "Account", "value": "AWS_ACCOUNT_ID"},
            {"name": "Total Instances", "value": "NUMBER_OF_INSTANCES"},
            {"name": "Estimated Monthly Savings", "value": "$MONTHLY_SAVINGS"},
            {"name": "Report Location", "value": "S3_REPORT_PATH"}
        ],
        "markdown": True
    }]
}

Let’s break down the key components of this card:

  • @type and @context: These tell Microsoft Teams what kind of message this is and how to interpret it using the MessageCard schema.

  • summary: A short summary string used in notifications and previews.

  • sections: A list of structured data blocks. Each section supports text, images, facts, and headers.

    • activityTitle: Bold header that grabs the user’s attention (like a subject line).

    • activitySubtitle: Smaller subheading for additional context.

    • text: A brief description of the message, giving users clarity on what they’re looking at.

    • facts: A list of key-value pairs presented in a tabular format. These are ideal for showing metadata like the AWS account number, instance count, estimated costs, or links to more detailed reports.

    • markdown: A boolean flag that allows for bold, italic, and other rich-text elements inside the text and facts fields.

This formatting makes it easy for Teams users to quickly interpret what’s going on and where to go next.

Sending the Message

After formatting the card, we send it to Teams using a simple HTTP POST to the webhook URL.

import urllib3
import json

http = urllib3.PoolManager()
response = http.request(
    "POST",
    TEAMS_WEBHOOK_URL,
    body=json.dumps(card).encode("utf-8"),
    headers={"Content-Type": "application/json"}
)

if response.status == 200:
    print("Teams notification sent successfully.")
else:
    print(f"Failed to send Teams notification. Status: {response.status}")
  • First, a connection pool is created using urllib3.PoolManager(). This is essential for managing and reusing HTTP connections efficiently.

  • Then, the request() method sends a POST request to the Teams webhook URL. The request includes:

    • The HTTP method (POST), as Teams expects messages to be submitted this way

    • The TEAMS_WEBHOOK_URL, which is the destination for our message

    • A request body containing our message card, JSON-encoded and UTF-8 encoded

    • HTTP headers declaring the payload as application/json

After making the request, the script evaluates the response code. If it's 200 OK, the message was successfully delivered to Teams. Otherwise, an error message is logged with the returned HTTP status.

This minimal setup works seamlessly in serverless environments like AWS Lambda and avoids the need for heavier dependencies like requests, making it ideal for quick, automated reporting.

Getting Your Microsoft Teams Webhook URL

Before you can post messages to Teams, you need to set up an Incoming Webhook in your desired channel:

  • In Microsoft Teams, navigate to the channel where you want the messages to appear.
  • Click the ellipsis (⋯) next to the channel name, then select Connectors.
  • Search for Incoming Webhook, then click Configure.
  • Name your webhook (e.g., AWS Resource Alerts) and optionally upload an icon.


  • Click Create, and Teams will generate a Webhook URL.
  • Copy that URL—you’ll need to paste it into your Python script.

Now, in your Python script, you can define it like this:

TEAMS_WEBHOOK_URL = "https://your.webhook.office.com/..."

Be sure to store your webhook URL securely. For production environments, consider injecting it as an environment variable or retrieving it securely through a tool like AWS Secrets Manager.

In my setup, this script runs every Thursday. It collects the week’s findings, uploads a CSV report to S3, and posts a summary to our Microsoft Teams channel. This gives everyone a chance to review the results and provide feedback before any cleanup tasks are scheduled.

The purpose of automation isn’t just speed—it’s alignment. By connecting Python scripts to Teams, we’re not just executing processes automatically, we’re keeping the right people informed. If you’re running regular audits or cost-optimization checks, make the results visible in the tools your team already uses. It’s a simple way to drive transparency and collaboration.



In several of my recent posts, I’ve discussed using Lambda scripting to identify and clean up unused resources in AWS environments. While these tasks traditionally fell under DevOps, they are now part of a broader discipline known as FinOps. Short for Financial Operations, FinOps merges financial management with operational efficiencies to maximize the value organizations derive from cloud computing.

Although the FinOps Foundation formally established the concept in 2019, its principles date back to the early 2010s. During this time, businesses began focusing on managing cloud costs as the shift from capital expenditure (CapEx) to operational expenditure (OpEx) models made cost efficiency a priority.


Tagging in cloud environments, particularly in development settings, is a foundational practice that can transform the way organizations manage and optimize their resources. Beyond simple organization, tagging serves as a critical tool for financial operations, resource accountability, and operational efficiency. By implementing a robust tagging strategy, teams can address common challenges in cloud resource management, such as uncontrolled costs, unclear ownership, and untracked manual processes.


Why Tagging Matters

One of the major reasons for tagging is its indispensable role in financial operations. By associating resources with specific tags, such as **owner=cloudformation** or **owner=terraform**, organizations can:

1. Tie Costs to Individual Parties: Tags enable teams to allocate costs accurately to the individuals, teams, or projects responsible for creating and maintaining resources. This not only promotes accountability but also fosters better budget management.

2. Distinguish Between Automation and Manual Processes: Tags can identify whether resources were created through automated pipelines or manual efforts. For example:

  •    Resources tagged with owner=cloudformation or owner=terraform are created through Infrastructure-as-Code (IaC) platforms, ensuring consistency and reliability.
  •    Resources lacking these tags often indicate manual creation, which can be prone to errors and higher costs.

3. Enable Proactive Cleanup: Automated processes can periodically scan for resources without proper tags. These orphaned or manually created resources can be flagged for cleanup, leading to two key benefits:

  •    Cost Savings: Unused or untracked resources can quietly accumulate costs. Tagging ensures these resources are identified and removed, saving money.
  •    Encouraging Best Practices: Developers are incentivized to use automated pipelines or collaborate with DevOps teams to build resources through code, fostering a culture of efficiency and predictability.


Automating Resource Management

By leveraging tagging, organizations can implement automated processes to manage their cloud environments effectively. For example, scheduled cleanup scripts or tools like AWS Config can identify resources lacking specific tags and delete them in development environments. This approach:

1. Saves Money: Automatically removing untracked resources reduces unnecessary costs.

2. Pushes for Infrastructure-as-Code Adoption: Retaining only resources created through automation encourages teams to adopt IaC practices or collaborate with DevOps. This ensures that resources are built with long-term sustainability and scalability in mind.


Strengthening DevOps Collaboration

Tagging is not just about resource management—it’s a communication tool. When developers tag resources appropriately, it provides DevOps teams with critical visibility into ongoing projects. This:

  • Improves Planning: Un-tagged resources alert DevOps to new implementations, allowing them to prepare infrastructure and tools ahead of time.
  • Prevents Emergencies: With early awareness of new resource needs, DevOps can prevent rushed deployments and ensure proper architectural planning.


Standardized Tagging in Practice

To implement a tagging standard, organizations can follow these guidelines:

1. Use Clear, Consistent Tags:

  •    owner=cloudformation: For resources created via CloudFormation.
  •    billing=operations: For resources owned/maintained and used by the operations team
  •    instance=linux: Calling out instance type of a server or Kubernetes node. 

2. Retroactively Tag Existing Resources: Apply owner tags to all previously created resources to maintain consistency and accountability.

3. Integrate Tagging into Automation: Ensure that all IaC templates include the appropriate tags:

   - CloudFormation:

     Resources:

       MyResource:

         Type: AWS::EC2::Instance

         Properties:

           Tags:

             - Key: owner

               Value: cloudformation

   

   - Terraform:

     resource "aws_instance" "example" {

       ami           = "ami-123456"

       instance_type = "t2.micro"

       tags = {

         owner = "terraform"

       }

     }


4. Automate Tag Compliance Checks: Use tools like AWS Config or custom scripts to ensure all resources adhere to the tagging policy.

5. Set a Cleanup Policy: In development environments, resources missing appropriate tags should be flagged and removed during scheduled cleanup cycles, unless exceptions are explicitly communicated. Tools such as aws-nuke and cloud-nuke


The Impact of Effective Tagging

When implemented effectively, tagging enables:

  • Enhanced Visibility: Teams can easily differentiate between automated and manual resources.
  • Cost Optimization: Proper tagging ensures that all resources are accounted for, reducing the risk of unnecessary expenses.
  • Streamlined Operations: DevOps teams gain better insight into development activities, enabling proactive support and infrastructure planning.

By adopting a standardized tagging policy and embedding it into daily workflows, organizations can unlock the full potential of their cloud environments. Tagging is more than a technical practice; it’s a cultural shift that promotes accountability, collaboration, and efficiency.


GitHub Location

Recently, we faced a situation where we found an account with over 25 TB of EBS snapshots, some of which dated back to 2017. These old snapshots had been piling up, creating substantial, unnecessary costs. We realized that without cleanup, costs would only increase, especially in our dev environment, where frequent changes to snapshots were generating excess storage overhead. This Lambda function was developed as a solution to automate the cleanup of outdated snapshots and refine our volume snapshot policy, allowing us to regain control over storage costs effectively.

1. Cost Management and Optimization

  • Accruing Storage Costs: Each EBS snapshot incurs storage costs based on the amount of data stored. Over time, as snapshots accumulate, these costs can become significant, especially for organizations with multiple environments or large volumes of data.
  • Automated Cleanup: A Lambda function helps to automate the deletion of older, unnecessary snapshots, ensuring that only the most recent backups are retained, which optimizes storage costs by freeing up space occupied by outdated snapshots.

2. Improved Operational Efficiency

  • Avoid Manual Intervention: Managing snapshots manually can be time-consuming, especially in large-scale environments where multiple snapshots are created daily. By automating snapshot cleanup, the Lambda function eliminates the need for manual review and deletion, saving valuable time for the operations team.
  • Consistency and Reliability: Automation ensures that snapshots are managed consistently according to defined policies. This prevents oversight and ensures a reliable, predictable process for snapshot lifecycle management.

3. Risk Mitigation

  • Avoid Accidental Deletion of Important Snapshots: By automating with a well-defined Lambda function, you can set policies to retain only the latest snapshots, significantly reducing the risk of accidentally deleting snapshots that are crucial for disaster recovery or compliance.
  • Streamlined Backup Management: With snapshots cleaned up regularly, only relevant backups are kept, simplifying recovery processes. This means that if data needs to be restored, engineers don’t need to sift through an excess of snapshots to locate the right one, which is especially critical during high-pressure situations like system recovery.

4. Scalability

  • Handles Growing Data Volumes: As infrastructure scales, so do the number of snapshots created. A Lambda function can automatically scale to handle snapshot cleanup across all regions and accounts without requiring additional infrastructure.
  • Facilitates Cross-Region and Multi-Account Management: For enterprises with complex, multi-region, or multi-account setups, a Lambda function can centralize snapshot cleanup policies across environments, streamlining overall backup management.

5. Compliance and Audit Readiness

  • Retention Policies: Many organizations must comply with data retention policies that dictate how long certain data must be kept. A Lambda function can enforce these retention rules consistently, ensuring snapshots are kept or deleted according to compliance requirements.
  • Audit-Friendly: The function can be configured to generate a report of deleted snapshots and the remaining storage, making it easy to demonstrate compliance and cost efficiency during audits.

6. Enhanced Security

  • Reduce Attack Surface: Unnecessary snapshots can potentially expose outdated or unpatched data. By regularly deleting unused snapshots, you reduce the attack surface, helping to protect sensitive information that may be stored in older snapshots.
  • Automated Logs and Notifications: This function can log and notify when snapshots are deleted, offering visibility into backup and cleanup processes, which can be monitored to ensure secure and compliant operations.


Overview

This AWS Lambda function is designed to manage Amazon Elastic Block Store (EBS) snapshots by identifying old snapshots, optionally deleting them, storing a summary of the old snapshots in an Amazon S3 bucket, and notifying a specified Slack channel about the results. This function is especially useful for managing snapshot storage costs by ensuring that only the latest snapshots are retained while tracking the total space occupied by old snapshots.

Key Components

  1. Logging Configuration: Configures logging for information and error handling.

  2. Constants:

    • DELETE_SNAPSHOTS: Enables or disables the deletion of old snapshots.
    • SEND_SLACK_MESSAGE: Controls whether a Slack notification will be sent.
    • S3_BUCKET and S3_KEY: Define the S3 bucket and file name where the CSV report of old snapshots will be saved.
    • SLACK_WEBHOOK_URL and SLACK_CHANNEL: Specify the Slack Webhook URL and the Slack channel to send a notification about the uploaded report.
  3. Dependencies: Uses the boto3 library to interact with AWS services, csv for generating CSV output, and urllib3 for HTTP requests (to send Slack notifications).

Function Details

lambda_handler(event, context)

This is the main function that runs when the Lambda function is triggered. It performs several steps:

  1. Retrieve EBS Snapshots:

    • Uses the describe_snapshots API to get all EBS snapshots owned by the account, organized by volume.
    • Sorts the snapshots for each volume by the creation time (StartTime), from the most recent to the oldest.
  2. Organize and Filter Snapshots:

    • Groups snapshots by volume and retains only the newest 4 snapshots for each volume.
    • Flags the older snapshots as candidates for deletion or reporting.
  3. Generate CSV Report:

    • Creates a CSV file in memory using StringIO, which will hold details of each old snapshot, including:
      • SnapshotId, StartTime, VolumeId, State, Description, Size (GiB), and Name.
    • Accumulates and logs the total storage size (in GiB) occupied by the snapshots for each volume and overall.
  4. Optional Deletion of Snapshots:

    • If DELETE_SNAPSHOTS is enabled, the function deletes each identified old snapshot.
    • Logs each deletion action.
  5. Upload CSV to S3:

    • The CSV report is uploaded to the specified S3 bucket and key, which provides a historical record of old snapshots and their associated storage costs.
  6. Slack Notification:

    • If SEND_SLACK_MESSAGE is enabled, the function sends a notification to the specified Slack channel with a link to the S3 report and the total storage size marked for deletion.
  7. Error Handling:

    • Uses try-except blocks to handle and log any errors encountered during the snapshot processing, deletion, or Slack notification stages.

Helper Function: get_instance_name(ec2, volume_id)

This function retrieves the name of an EC2 instance attached to a given volume ID:

  • Uses the describe_volumes and describe_instances APIs to fetch the instance associated with the specified volume.
  • Searches the instance tags to find and return the Name tag value.

Sample Output and Notifications

  • CSV Output: The CSV contains details about old snapshots by volume, with information about each snapshot’s size and other metadata. A total size summary is also added to the CSV.
  • Slack Notification: Sends a message to the specified Slack channel with details about the report location in S3 and the total size (in GiB) of snapshots targeted for deletion.

Configuration Tips

  • S3 Bucket and Slack Channel: Update S3_BUCKET, S3_KEY, SLACK_WEBHOOK_URL, and SLACK_CHANNEL as per your environment.
  • Permissions: Ensure the Lambda function’s IAM role has permissions to access EC2 (for snapshots), S3 (for uploading CSV files), and the necessary logging and notification services.

IAM Role policies

To allow this Lambda function to perform the necessary actions on EBS snapshots, S3, and Slack, you’ll need to create an IAM role for the Lambda function with specific policies attached. Here are the policies and permissions required:

1. EC2 Permissions (for managing snapshots)

These permissions enable the Lambda function to describe and delete snapshots:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeSnapshots",
                "ec2:DeleteSnapshot",
                "ec2:DescribeVolumes",
                "ec2:DescribeInstances"
            ],
            "Resource": "*"
        }
    ]
}

2. S3 Permissions (for uploading the CSV report)

These permissions allow the Lambda function to write the CSV file with snapshot details to an S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject"
            ],
            "Resource": "arn:aws:s3:::verato-snapshot-inspection/*"  // Replace with your bucket name
        }
    ]
}

3. CloudWatch Logs Permissions (for logging)

This permission enables the Lambda function to create log groups and write logs for monitoring and debugging purposes:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*"
        }
    ]
}

Combine these policies into a custom IAM role and attach it to the Lambda function to allow snapshot management, S3 access, and logging capabilities. This will cover the function’s requirements for cleanup, reporting, and logging. 

If you need to send notifications via Slack, the Lambda function requires no additional AWS permissions, as the Slack API communication is handled over HTTPS within the code itself.
Previous PostOlder Posts Home