Although the FinOps Foundation formally established the concept in 2019, its principles date back to the early 2010s. During this time, businesses began focusing on managing cloud costs as the shift from capital expenditure (CapEx) to operational expenditure (OpEx) models made cost efficiency a priority.
Tagging in cloud environments, particularly in development settings, is a foundational practice that can transform the way organizations manage and optimize their resources. Beyond simple organization, tagging serves as a critical tool for financial operations, resource accountability, and operational efficiency. By implementing a robust tagging strategy, teams can address common challenges in cloud resource management, such as uncontrolled costs, unclear ownership, and untracked manual processes.
Why Tagging Matters
One of the major reasons for tagging is its indispensable role in financial operations. By associating resources with specific tags, such as **owner=cloudformation** or **owner=terraform**, organizations can:
1. Tie Costs to Individual Parties: Tags enable teams to allocate costs accurately to the individuals, teams, or projects responsible for creating and maintaining resources. This not only promotes accountability but also fosters better budget management.
2. Distinguish Between Automation and Manual Processes: Tags can identify whether resources were created through automated pipelines or manual efforts. For example:
- Resources tagged with owner=cloudformation or owner=terraform are created through Infrastructure-as-Code (IaC) platforms, ensuring consistency and reliability.
- Resources lacking these tags often indicate manual creation, which can be prone to errors and higher costs.
3. Enable Proactive Cleanup: Automated processes can periodically scan for resources without proper tags. These orphaned or manually created resources can be flagged for cleanup, leading to two key benefits:
- Cost Savings: Unused or untracked resources can quietly accumulate costs. Tagging ensures these resources are identified and removed, saving money.
- Encouraging Best Practices: Developers are incentivized to use automated pipelines or collaborate with DevOps teams to build resources through code, fostering a culture of efficiency and predictability.
Automating Resource Management
By leveraging tagging, organizations can implement automated processes to manage their cloud environments effectively. For example, scheduled cleanup scripts or tools like AWS Config can identify resources lacking specific tags and delete them in development environments. This approach:
1. Saves Money: Automatically removing untracked resources reduces unnecessary costs.
2. Pushes for Infrastructure-as-Code Adoption: Retaining only resources created through automation encourages teams to adopt IaC practices or collaborate with DevOps. This ensures that resources are built with long-term sustainability and scalability in mind.
Strengthening DevOps Collaboration
Tagging is not just about resource management—it’s a communication tool. When developers tag resources appropriately, it provides DevOps teams with critical visibility into ongoing projects. This:
- Improves Planning: Un-tagged resources alert DevOps to new implementations, allowing them to prepare infrastructure and tools ahead of time.
- Prevents Emergencies: With early awareness of new resource needs, DevOps can prevent rushed deployments and ensure proper architectural planning.
Standardized Tagging in Practice
To implement a tagging standard, organizations can follow these guidelines:
1. Use Clear, Consistent Tags:
- owner=cloudformation: For resources created via CloudFormation.
- billing=operations: For resources owned/maintained and used by the operations team
- instance=linux: Calling out instance type of a server or Kubernetes node.
2. Retroactively Tag Existing Resources: Apply owner tags to all previously created resources to maintain consistency and accountability.
3. Integrate Tagging into Automation: Ensure that all IaC templates include the appropriate tags:
- CloudFormation:
Resources:
MyResource:
Type: AWS::EC2::Instance
Properties:
Tags:
- Key: owner
Value: cloudformation
- Terraform:
resource "aws_instance" "example" {
ami = "ami-123456"
instance_type = "t2.micro"
tags = {
owner = "terraform"
}
}
4. Automate Tag Compliance Checks: Use tools like AWS Config or custom scripts to ensure all resources adhere to the tagging policy.
5. Set a Cleanup Policy: In development environments, resources missing appropriate tags should be flagged and removed during scheduled cleanup cycles, unless exceptions are explicitly communicated. Tools such as aws-nuke and cloud-nuke
The Impact of Effective Tagging
When implemented effectively, tagging enables:
- Enhanced Visibility: Teams can easily differentiate between automated and manual resources.
- Cost Optimization: Proper tagging ensures that all resources are accounted for, reducing the risk of unnecessary expenses.
- Streamlined Operations: DevOps teams gain better insight into development activities, enabling proactive support and infrastructure planning.
By adopting a standardized tagging policy and embedding it into daily workflows, organizations can unlock the full potential of their cloud environments. Tagging is more than a technical practice; it’s a cultural shift that promotes accountability, collaboration, and efficiency.
Recently, we faced a situation where we found an account with over 25 TB of EBS snapshots, some of which dated back to 2017. These old snapshots had been piling up, creating substantial, unnecessary costs. We realized that without cleanup, costs would only increase, especially in our dev environment, where frequent changes to snapshots were generating excess storage overhead. This Lambda function was developed as a solution to automate the cleanup of outdated snapshots and refine our volume snapshot policy, allowing us to regain control over storage costs effectively.
1. Cost Management and Optimization
- Accruing Storage Costs: Each EBS snapshot incurs storage costs based on the amount of data stored. Over time, as snapshots accumulate, these costs can become significant, especially for organizations with multiple environments or large volumes of data.
- Automated Cleanup: A Lambda function helps to automate the deletion of older, unnecessary snapshots, ensuring that only the most recent backups are retained, which optimizes storage costs by freeing up space occupied by outdated snapshots.
2. Improved Operational Efficiency
- Avoid Manual Intervention: Managing snapshots manually can be time-consuming, especially in large-scale environments where multiple snapshots are created daily. By automating snapshot cleanup, the Lambda function eliminates the need for manual review and deletion, saving valuable time for the operations team.
- Consistency and Reliability: Automation ensures that snapshots are managed consistently according to defined policies. This prevents oversight and ensures a reliable, predictable process for snapshot lifecycle management.
3. Risk Mitigation
- Avoid Accidental Deletion of Important Snapshots: By automating with a well-defined Lambda function, you can set policies to retain only the latest snapshots, significantly reducing the risk of accidentally deleting snapshots that are crucial for disaster recovery or compliance.
- Streamlined Backup Management: With snapshots cleaned up regularly, only relevant backups are kept, simplifying recovery processes. This means that if data needs to be restored, engineers don’t need to sift through an excess of snapshots to locate the right one, which is especially critical during high-pressure situations like system recovery.
4. Scalability
- Handles Growing Data Volumes: As infrastructure scales, so do the number of snapshots created. A Lambda function can automatically scale to handle snapshot cleanup across all regions and accounts without requiring additional infrastructure.
- Facilitates Cross-Region and Multi-Account Management: For enterprises with complex, multi-region, or multi-account setups, a Lambda function can centralize snapshot cleanup policies across environments, streamlining overall backup management.
5. Compliance and Audit Readiness
- Retention Policies: Many organizations must comply with data retention policies that dictate how long certain data must be kept. A Lambda function can enforce these retention rules consistently, ensuring snapshots are kept or deleted according to compliance requirements.
- Audit-Friendly: The function can be configured to generate a report of deleted snapshots and the remaining storage, making it easy to demonstrate compliance and cost efficiency during audits.
6. Enhanced Security
- Reduce Attack Surface: Unnecessary snapshots can potentially expose outdated or unpatched data. By regularly deleting unused snapshots, you reduce the attack surface, helping to protect sensitive information that may be stored in older snapshots.
- Automated Logs and Notifications: This function can log and notify when snapshots are deleted, offering visibility into backup and cleanup processes, which can be monitored to ensure secure and compliant operations.
Overview
This AWS Lambda function is designed to manage Amazon Elastic Block Store (EBS) snapshots by identifying old snapshots, optionally deleting them, storing a summary of the old snapshots in an Amazon S3 bucket, and notifying a specified Slack channel about the results. This function is especially useful for managing snapshot storage costs by ensuring that only the latest snapshots are retained while tracking the total space occupied by old snapshots.
Key Components
Logging Configuration: Configures logging for information and error handling.
Constants:
DELETE_SNAPSHOTS
: Enables or disables the deletion of old snapshots.SEND_SLACK_MESSAGE
: Controls whether a Slack notification will be sent.S3_BUCKET
andS3_KEY
: Define the S3 bucket and file name where the CSV report of old snapshots will be saved.SLACK_WEBHOOK_URL
andSLACK_CHANNEL
: Specify the Slack Webhook URL and the Slack channel to send a notification about the uploaded report.
Dependencies: Uses the
boto3
library to interact with AWS services,csv
for generating CSV output, andurllib3
for HTTP requests (to send Slack notifications).
Function Details
lambda_handler(event, context)
This is the main function that runs when the Lambda function is triggered. It performs several steps:
Retrieve EBS Snapshots:
- Uses the
describe_snapshots
API to get all EBS snapshots owned by the account, organized by volume. - Sorts the snapshots for each volume by the creation time (
StartTime
), from the most recent to the oldest.
- Uses the
Organize and Filter Snapshots:
- Groups snapshots by volume and retains only the newest 4 snapshots for each volume.
- Flags the older snapshots as candidates for deletion or reporting.
Generate CSV Report:
- Creates a CSV file in memory using
StringIO
, which will hold details of each old snapshot, including:SnapshotId
,StartTime
,VolumeId
,State
,Description
,Size (GiB)
, andName
.
- Accumulates and logs the total storage size (in GiB) occupied by the snapshots for each volume and overall.
- Creates a CSV file in memory using
Optional Deletion of Snapshots:
- If
DELETE_SNAPSHOTS
is enabled, the function deletes each identified old snapshot. - Logs each deletion action.
- If
Upload CSV to S3:
- The CSV report is uploaded to the specified S3 bucket and key, which provides a historical record of old snapshots and their associated storage costs.
Slack Notification:
- If
SEND_SLACK_MESSAGE
is enabled, the function sends a notification to the specified Slack channel with a link to the S3 report and the total storage size marked for deletion.
- If
Error Handling:
- Uses
try-except
blocks to handle and log any errors encountered during the snapshot processing, deletion, or Slack notification stages.
- Uses
Helper Function: get_instance_name(ec2, volume_id)
This function retrieves the name of an EC2 instance attached to a given volume ID:
- Uses the
describe_volumes
anddescribe_instances
APIs to fetch the instance associated with the specified volume. - Searches the instance tags to find and return the
Name
tag value.
Sample Output and Notifications
- CSV Output: The CSV contains details about old snapshots by volume, with information about each snapshot’s size and other metadata. A total size summary is also added to the CSV.
- Slack Notification: Sends a message to the specified Slack channel with details about the report location in S3 and the total size (in GiB) of snapshots targeted for deletion.
Configuration Tips
- S3 Bucket and Slack Channel: Update
S3_BUCKET
,S3_KEY
,SLACK_WEBHOOK_URL
, andSLACK_CHANNEL
as per your environment. - Permissions: Ensure the Lambda function’s IAM role has permissions to access EC2 (for snapshots), S3 (for uploading CSV files), and the necessary logging and notification services.
IAM Role policies
1. EC2 Permissions (for managing snapshots)
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["ec2:DescribeSnapshots","ec2:DeleteSnapshot","ec2:DescribeVolumes","ec2:DescribeInstances"],"Resource": "*"}]}
2. S3 Permissions (for uploading the CSV report)
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["s3:PutObject","s3:GetObject"],"Resource": "arn:aws:s3:::verato-snapshot-inspection/*" // Replace with your bucket name}]}
3. CloudWatch Logs Permissions (for logging)
{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Action": ["logs:CreateLogGroup","logs:CreateLogStream","logs:PutLogEvents"],"Resource": "*"}]}
Cleaning up stopped EIPs instances is a crucial maintenance task for AWS accounts to avoid unnecessary costs associated with EIP instances not attached to a resource. To streamline this process, I’ve set up two versions of a Lambda function to automate the identification and deletion of stopped instances.
Each week, one version of the Lambda function runs on Thursday to inspect stopped instances and log them for review, while another version runs on Saturday to delete the identified instances. This two-phase approach allows time to verify what instances are flagged for deletion before executing the cleanup.
Amazon EventBridge
To run the Lambda function on a specific day, you can use Amazon EventBridge (formerly CloudWatch Events). EventBridge allows you to create a scheduled rule that triggers the Lambda function at a specific time.
- Navigate to EventBridge in the AWS Management Console.
- Create a Rule:
- Define the cron expression or rate expression for the desired schedule. For exam
- To run at 7 AM every Thursday:
cron(0 7 ? * 5 *)
- This cron expression means: "At 07:00 AM UTC on every Thursday."
- Set Target to your Lambda function.
Step 2: Pass a Specific Set of Environment Variables
To use specific environment variables for a particular run:
Use AWS Lambda Versions and Aliases:
- You can create different versions of the Lambda function, each with its own set of environment variables.
- For example, you can create a version with inspection variables (
DELETE_QUEUES=False
) and another with deletion variables (DELETE_QUEUES=True
). - Assign an alias to each version (e.g.,
inspection
anddeletion
).
EventBridge Rule Target Configuration:
- In the target configuration of the EventBridge rule, specify the alias for the Lambda version you want to run.
- This allows you to run different versions of the Lambda function based on the schedule.
Step 3: Use Code Variables
If you need to dynamically set environment variables for each run:
Update Environment Variables in Code:
- Modify the Lambda function code to accept environment variable overrides via the event payload.
import osdef lambda_handler(event, context): # Override environment variables if provided in the event delete_queues = event.get('DELETE_QUEUES', os.getenv('DELETE_QUEUES', 'True')).lower() == 'true' send_slack_message = event.get('SEND_SLACK_MESSAGE', os.getenv('SEND_SLACK_MESSAGE', 'True')).lower() == 'true' # Your logic here...
- Create implementations of EventBridge with different versions of the code and variables:
- Assuming you already have a Lambda function that checks for non-running EC2 instances and deletes them if required, you’ll need to create two separate versions:
- Version 1: For inspection (running every Thursday, without deleting).
- Version 2: For deletion (running every Saturday, with deletion enabled).
Step 4: Use Environmental Variables
When using ArgoCD to automate your application deployments, you might encounter an issue where a deployment gets stuck in the Progressing state. This often happens due to kubernetes finalizers not being released properly, preventing the related pods and resources from being destroyed.
Finalizers are Kubernetes resources that ensure certain clean-up tasks are completed before an object is fully deleted, but if they malfunction or don’t get removed, they can block deletion and cause your deployment to hang indefinitely.
I’ll walk you through a step-by-step process to resolve this issue by manually removing the finalizers and then successfully deleting the stuck ArgoCD app.
Step-by-Step Guide
Step 1: Identify the Problem
When an ArgoCD app gets stuck in the "Progressing" state, you can confirm the issue by inspecting the status of the app and looking for finalizers that are preventing deletion.
You can do this by running the following command:
kubectl get app APP_NAME -o yaml
Look for the metadata.finalizers field. If you see finalizers listed but the app cannot progress to completion, that’s the cause of the problem.
Step 2: Patch the App to Remove Finalizers
To resolve the stuck state, you need to remove the finalizers from the ArgoCD app. You can do this by running the following **kubectl** command:
kubectl patch app APP_NAME -p '{"metadata": {"finalizers": null}}' --type merge
This command removes the finalizers from the app, allowing Kubernetes to bypass the finalizer logic and proceed with the app deletion.
Step 3: Patch the CRD (If Necessary)
In some cases, the Custom Resource Definition (CRD) associated with the app may also have finalizers that are causing the stuck state. To remove the finalizers from the CRD, use the following command:
kubectl patch crd CRD_NAME -p '{"metadata": {"finalizers": null}}' --type merge
This ensures that any finalizer present in the CRD is also removed, allowing for complete deletion of the resources.
Step 4: Delete the Stuck App
After patching the finalizers, you can safely delete the stuck app by using the following command:
kubectl delete app APP_NAME
Step 5: Delete the CRD
If needed, you can also delete the CRD after patching its finalizers:
kubectl delete crd CRD_NAME
This should fully remove the app and any related resources, resolving the stuck progressing state.
Deployments getting stuck in the Progressing state in ArgoCD can often be traced back to finalizers not being properly removed, which blocks resource deletion. By manually patching the app and any related CRDs to remove finalizers, you can resolve this issue and successfully delete the resources.
If you’re facing this issue frequently, consider reviewing the finalizer behavior in your environment and ArgoCD configurations to ensure that finalizers are being handled correctly during normal operations. Properly configured finalizers will help avoid these kinds of issues in the future.