Terraform Cheatsheet¶
AI Assisted (Grok 3)
A cheatsheet which is actually useful to setup production grade IaC project using Terraform and Gitlab covering :
- Multi-Environment setup
- CI/CD
- State Management
- Rollbacks
- Capturing and Logging Outputs
And more..
1. Core Concepts¶
- Infrastructure as Code (IaC): Define infrastructure (e.g., Lambdas, databases, data lakes, compute clusters) in
.tf
files. - Providers: Plugins for cloud platforms (e.g.,
aws
,google
,azurerm
) or tools (e.g.,snowflake
,databricks
). - Resources: Infrastructure components (e.g., S3 buckets, RDS instances, BigQuery datasets).
- Modules: Reusable, parameterized templates for common infrastructure patterns.
- State: Tracks infrastructure state, stored locally (
terraform.tfstate
) or remotely (e.g., S3, GCS). - Workspaces: Isolate environments (e.g.,
dev
,staging
,prod
) within a single configuration directory using multiple state files. Note: This project uses folder-based isolation (separate directories for each environment) instead of workspaces for better clarity and flexibility.
2. Key Terraform Components and State Management¶
- State Management:
- The
terraform.tfstate
file records the current state of your infrastructure, mapping Terraform configurations to real-world resources, including dynamically generated attributes (e.g., S3 bucket ARNs, RDS endpoints). - Local State: Stored locally by default; suitable for solo or small projects but risky for teams due to potential conflicts or loss.
- Remote State: Store state in a remote backend (e.g., S3, GCS, Azure Blob) with locking (e.g., DynamoDB) to prevent concurrent modifications and enable team collaboration.
- Environment Isolation: This project uses separate directories (
environments/dev
,environments/staging
,environments/prod
) with distinct state files (e.g.,data-infra/dev/terraform.tfstate
) instead of Terraform workspaces, providing clear separation and environment-specific configurations. - Best Practices: Use remote backends for production, encrypt state files, restrict access via IAM, and regularly back up state files.
- State Drift: Occurs when manual changes bypass Terraform. Use
terraform refresh
to update state orterraform import
to bring resources under management.
- The
-
outputs.tf:
- Defines output values exposed after
terraform apply
, such as resource IDs, endpoints, or computed values (e.g., S3 bucket ARN, RDS endpoint). - How It Works: Outputs reference attributes stored in the state file, which captures dynamically generated values during resource creation.
outputs.tf
itself does not store values but declares what to extract from the state. - Example:
output "s3_bucket_arn" { value = aws_s3_bucket.bucket.arn description = "ARN of the S3 bucket" }
- After
terraform apply
, the state file stores the bucket’s ARN, and the output retrieves it. - Updating Outputs: Edit
outputs.tf
in the repository to add, modify, or remove outputs. Commit changes to Git, and they take effect on the nextterraform apply
orterraform output
. - Usage: Outputs are displayed in the CLI (
terraform output
), exported to JSON (terraform output -json
), used in CI/CD pipelines, or referenced by other modules via remote state. - Best Practices: Use descriptive names and descriptions, avoid sensitive data in outputs, and capture outputs in pipelines for automation.
- Defines output values exposed after
-
variables.tf: Declares input variables with types, defaults, and validations to parameterize configurations.
- terraform.tfvars: Provides default variable values, overridden by environment-specific
.tfvars
files or CLI flags. - backend.tf: Configures the remote state backend, specifying where and how state is stored and locked.
3. Repository Folder Structure¶
A well-organized repository structure ensures clarity, modularity, and scalability for managing data infrastructure across multiple environments.
data-infra/
├── modules/ # Reusable Terraform modules
│ ├── s3_bucket/ # S3 bucket module
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── rds/ # RDS module
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ ├── databricks/ # Databricks module
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── outputs.tf
│ └── glue_job/ # Glue job module
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/ # Environment-specific configurations
│ ├── dev/
│ │ ├── main.tf # Dev-specific resources
│ │ ├── variables.tf
│ │ ├── outputs.tf # Environment-specific outputs
│ │ ├── backend.tf
│ │ └── terraform.tfvars # Dev variable values
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── backend.tf
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── backend.tf
│ └── terraform.tfvars
├── scripts/ # Supporting scripts (e.g., PySpark for Glue)
│ └── transform.py
├── .gitignore # Git ignore file
├── README.md # Project documentation
└── .gitlab-ci.yml # CI/CD pipeline configuration
Explanation¶
- modules/: Contains reusable modules for resources like S3, RDS, Databricks, and Glue jobs, each with its own
outputs.tf
. - environments/: Separates configurations for
dev
,staging
, andprod
, each with its ownmain.tf
,variables.tf
,outputs.tf
,backend.tf
, andterraform.tfvars
, providing isolation without workspaces. - scripts/: Stores scripts (e.g., PySpark scripts for Glue jobs) uploaded to S3.
- .gitignore: Ignores
.terraform/
,*.tfstate
,*.tfstate.backup
, and sensitive files like*.tfvars
with secrets. - README.md: Documents setup instructions, module usage, CI/CD pipeline details, artifact review, rollback procedures, output management, and environment isolation.
4. Basic Commands¶
terraform init # Initialize project, download providers
terraform plan # Preview changes
terraform apply # Apply changes
terraform destroy # Tear down infrastructure
terraform state list # View managed resources
terraform state show <resource> # Inspect resource state
terraform output # Display all output values
terraform output -json # Export outputs as JSON
# Note: The following workspace commands are included for reference but are not used in this project, which uses folder-based environment isolation instead.
terraform workspace list # List workspaces (not used)
terraform workspace select <env> # Switch to environment (not used)
terraform workspace new <env> # Create new workspace (not used)
5. Provider Configuration¶
# environments/dev/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
snowflake = {
source = "Snowflake-Labs/snowflake"
version = "~> 0.76"
}
}
}
provider "aws" {
region = var.region
assume_role {
role_arn = "arn:aws:iam::${var.account_id}:role/TerraformRole"
}
}
provider "snowflake" {
account = var.snowflake_account
role = "ACCOUNTADMIN"
}
6. Multi-Environment Configuration¶
Variables¶
# environments/dev/variables.tf
variable "region" {
type = string
default = "us-east-1"
}
variable "account_id" {
type = string
}
variable "environment" {
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
Environment-Specific .tfvars
¶
# environments/dev/terraform.tfvars
region = "us-east-1"
account_id = "123456789012"
environment = "dev"
Applying Environment Config¶
cd environments/dev
terraform init
terraform apply -var-file=terraform.tfvars
7. Remote State Management¶
# environments/dev/backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "data-infra/dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
}
}
8. Module Example: S3 Bucket¶
# modules/s3_bucket/main.tf
resource "aws_s3_bucket" "bucket" {
bucket = "${var.environment}-${var.bucket_name}"
acl = "private"
versioning {
enabled = true
}
tags = {
Environment = var.environment
}
}
# modules/s3_bucket/variables.tf
variable "bucket_name" {
type = string
}
variable "environment" {
type = string
}
# modules/s3_bucket/outputs.tf
output "bucket_arn" {
value = aws_s3_bucket.bucket.arn
description = "ARN of the S3 bucket"
}
output "bucket_name" {
value = aws_s3_bucket.bucket.id
description = "Name of the S3 bucket"
}
Using the Module¶
# environments/dev/main.tf
module "data_lake" {
source = "../../modules/s3_bucket"
bucket_name = "data-lake"
environment = var.environment
}
# environments/dev/outputs.tf
output "data_lake_arn" {
value = module.data_lake.bucket_arn
description = "ARN of the data lake S3 bucket"
}
9. Data Infrastructure Examples¶
Snowflake Database¶
# Fetch existing Snowflake role
data "snowflake_role" "admin" {
name = "ACCOUNTADMIN"
}
resource "snowflake_database" "data_warehouse" {
name = "${var.environment}_DATA_WAREHOUSE"
}
resource "snowflake_schema" "raw" {
database = snowflake_database.data_warehouse.name
name = "RAW"
}
# outputs.tf
output "snowflake_database_name" {
value = snowflake_database.data_warehouse.name
description = "Name of the Snowflake database"
}
Databricks Cluster¶
# Fetch existing Databricks workspace
data "databricks_spark_version" "latest" {
latest = true
}
resource "databricks_cluster" "data_processing" {
cluster_name = "${var.environment}-data-processing"
spark_version = data.databricks_spark_version.latest.id
node_type_id = var.environment == "prod" ? "i3.2xlarge" : "i3.xlarge"
autotermination_minutes = 30
}
# outputs.tf
output "databricks_cluster_id" {
value = databricks_cluster.data_processing.cluster_id
description = "ID of the Databricks cluster"
}
AWS Glue Crawler¶
# Fetch existing Glue database
data "aws_glue_catalog_database" "main" {
name = "${var.environment}_catalog"
}
resource "aws_glue_crawler" "s3_crawler" {
name = "${var.environment}-s3-crawler"
database_name = data.aws_glue_catalog_database.main.name
role = aws_iam_role.glue_role.arn
s3_target {
path = module.data_lake.bucket
}
}
# outputs.tf
output "glue_crawler_name" {
value = aws_glue_crawler.s3_crawler.name
description = "Name of the Glue crawler"
}
AWS RDS Database¶
# Fetch existing VPC and subnets
data "aws_vpc" "default" {
default = true
}
data "aws_subnets" "default_subnets" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
# modules/rds/main.tf
resource "aws_db_subnet_group" "rds_subnet_group" {
name = "${var.environment}-rds-subnet-group"
subnet_ids = var.subnet_ids
tags = {
Environment = var.environment
}
}
resource "aws_db_instance" "rds_instance" {
identifier = "${var.environment}-${var.db_name}"
engine = var.db_engine
engine_version = var.engine_version
instance_class = var.environment == "prod" ? var.prod_instance_class : var.dev_instance_class
allocated_storage = var.allocated_storage
username = var.db_username
password = var.db_password
db_subnet_group_name = aws_db_subnet_group.rds_subnet_group.name
vpc_security_group_ids = var.security_group_ids
multi_az = var.environment == "prod" ? true : false
backup_retention_period = var.environment == "prod" ? 7 : 1
skip_final_snapshot = var.environment != "prod"
tags = {
Environment = var.environment
}
}
# modules/rds/variables.tf
variable "db_name" {
type = string
}
variable "environment" {
type = string
}
variable "db_engine" {
type = string
default = "postgres"
}
variable "engine_version" {
type = string
default = "15.3"
}
variable "prod_instance_class" {
type = string
default = "db.m5.large"
}
variable "dev_instance_class" {
type = string
default = "db.t3.micro"
}
variable "allocated_storage" {
type = number
default = 20
}
variable "db_username" {
type = string
}
variable "db_password" {
type = string
sensitive = true
}
variable "subnet_ids" {
type = list(string)
}
variable "security_group_ids" {
type = list(string)
}
# modules/rds/outputs.tf
output "rds_endpoint" {
value = aws_db_instance.rds_instance.endpoint
description = "Endpoint of the RDS instance"
}
# Using the RDS Module
# environments/dev/main.tf
module "rds_postgres" {
source = "../../modules/rds"
db_name = "datawarehouse"
environment = var.environment
db_engine = "postgres"
engine_version = "15.3"
prod_instance_class = "db.m5.large"
dev_instance_class = "db.t3.micro"
allocated_storage = 100
db_username = var.db_username
db_password = var.db_password
subnet_ids = data.aws_subnets.default_subnets.ids
security_group_ids = [aws_security_group.rds_sg.id]
}
# environments/dev/outputs.tf
output "rds_postgres_endpoint" {
value = module.rds_postgres.rds_endpoint
description = "Endpoint of the RDS PostgreSQL instance"
}
AWS Glue PySpark Job¶
# Fetch S3 bucket from another configuration
data "terraform_remote_state" "data_lake" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "data-infra/${var.environment}/terraform.tfstate"
region = "us-east-1"
}
}
# modules/glue_job/main.tf
resource "aws_glue_job" "pyspark_job" {
name = "${var.environment}-${var.job_name}"
role_arn = var.glue_role_arn
glue_version = "4.0"
worker_type = var.environment == "prod" ? "G.2X" : "G.1X"
number_of_workers = var.environment == "prod" ? 10 : 2
max_retries = 1
command {
script_location = "s3://${var.script_bucket}/${var.script_path}"
python_version = "python3"
name = "glueetl"
}
default_arguments = {
"--job-name" = "${var.environment}-${var.job_name}"
"--TempDir" = "s3://${var.temp_bucket}/temp/"
"--enable-metrics" = "true"
"--enable-continuous-log" = "true"
"--environment" = var.environment
}
tags = {
Environment = var.environment
}
}
# modules/glue_job/variables.tf
variable "job_name" {
type = string
}
variable "environment" {
type = string
}
variable "glue_role_arn" {
type = string
}
variable "script_bucket" {
type = string
}
variable "script_path" {
type = string
}
variable "temp_bucket" {
type = string
}
# modules/glue_job/outputs.tf
output "glue_job_name" {
value = aws_glue_job.pyspark_job.name
description = "Name of the Glue PySpark job"
}
# Using the Glue Job Module
# environments/dev/main.tf
module "glue_job" {
source = "../../modules/glue_job"
job_name = "data-transform"
environment = var.environment
glue_role_arn = aws_iam_role.glue_role.arn
script_bucket = data.terraform_remote_state.data_lake.outputs.bucket_name
script_path = "scripts/transform.py"
temp_bucket = data.terraform_remote_state.data_lake.outputs.bucket_name
}
# environments/dev/outputs.tf
output "glue_job_name" {
value = module.glue_job.glue_job_name
description = "Name of the Glue job"
}
10. GitLab CI/CD Pipeline for Terraform¶
A GitLab CI/CD pipeline automates validation, planning, and deployment of Terraform configurations across environments, with artifacts stored for review, auditing, and output capture. This project uses folder-based environment isolation (folders like environments/dev
, environments/staging
, etc.) instead of Terraform workspaces for clarity and flexibility.
Terraform Artifacts¶
Terraform artifacts are files generated during pipeline jobs (e.g., terraform plan
, terraform apply
, or terraform output
) saved in GitLab for review, auditing, or downstream use. They include:
-
Plan Artifacts:
tfplan
: Binary plan file fromterraform plan -out=tfplan
, used byterraform apply
.tfplan.json
: JSON representation (terraform show -json tfplan
) for programmatic analysis.tfplan.txt
: Human-readable text (terraform show tfplan
) for manual review.
-
Apply Artifacts:
apply.log
: Log file capturingterraform apply
output, detailing infrastructure changes.
-
Output Artifacts:
outputs.json
: JSON file capturingterraform output -json
, containing dynamically generated values (e.g., S3 bucket ARNs, RDS endpoints).
-
Purpose:
- Review: Plan artifacts for verifying changes before applying.
- Audit: Apply logs and outputs for compliance and debugging.
- Consistency: Binary
tfplan
ensuresapply
matches the reviewed plan. - Automation: Outputs enable integration with other systems or modules.
- Retention:
- GitLab artifacts can be stored for a specified period (e.g.,
1 year
orforever
, subject to GitLab plan). - For long-term audit, upload artifacts to S3 with custom retention policies.
Capturing Terraform Outputs¶
Outputs defined in outputs.tf
are captured in the pipeline to expose dynamically generated values for review, automation, or integration with other systems.
Steps to Capture Outputs¶
- Generate Outputs: Run
terraform output -json > outputs.json
in theapply
job to save outputs as a JSON file. - Store as Artifacts: Include
outputs.json
in theartifacts
section of.gitlab-ci.yml
. - Review Outputs: Download
outputs.json
from the GitLab pipeline UI or parse it programmatically (e.g.,jq '.s3_bucket_arn.value' outputs.json
). - Use Outputs: Reference
outputs.json
in downstream jobs (e.g., to configure applications) or upload to S3 for long-term storage.
Example Workflow¶
- After
apply_dev
runs,outputs.json
is generated, containing values likedata_lake_arn
andrds_postgres_endpoint
. - The team downloads
outputs.json
to verify resource endpoints. - Outputs are uploaded to S3 for audit or used in another job to configure a data pipeline.
Pipeline Configuration¶
# .gitlab-ci.yml
stages:
- validate
- plan
- apply
- backup
variables:
AWS_REGION: "us-east-1"
TF_VAR_account_id: "123456789012"
# Cache Terraform plugins
cache:
key: ${CI_PROJECT_ID}
paths:
- .terraform/
- environments/dev/.terraform/
- environments/staging/.terraform/
- environments/prod/.terraform/
# Base job template
.terraform_base:
image: hashicorp/terraform:1.5.7
before_script:
- terraform --version
- apk add --no-cache aws-cli
- aws configure set aws_access_key_id $AWS_ACCESS_KEY_ID
- aws configure set aws_secret_access_key $AWS_SECRET_ACCESS_KEY
- aws configure set region $AWS_REGION
# Validate Terraform configuration
validate:
stage: validate
extends: .terraform_base
script:
- cd environments/dev
- terraform init -backend-config=backend.tf
- terraform validate
- cd ../staging
- terraform init -backend-config=backend.tf
- terraform validate
- cd ../prod
- terraform init -backend-config=backend.tf
- terraform validate
# Plan for Dev
plan_dev:
stage: plan
extends: .terraform_base
script:
- cd environments/dev
- terraform init -backend-config=backend.tf
- terraform plan -var-file=terraform.tfvars -out=tfplan
- terraform show -json tfplan > tfplan.json
- terraform show tfplan > tfplan.txt
artifacts:
paths:
- environments/dev/tfplan
- environments/dev/tfplan.json
- environments/dev/tfplan.txt
expire_in: 1 week
only:
- main
- merge_requests
# Plan for Staging
plan_staging:
stage: plan
extends: .terraform_base
script:
- cd environments/staging
- terraform init -backend-config=backend.tf
- terraform plan -var-file=terraform.tfvars -out=tfplan
- terraform show -json tfplan > tfplan.json
- terraform show tfplan > tfplan.txt
artifacts:
paths:
- environments/staging/tfplan
- environments/staging/tfplan.json
- environments/staging/tfplan.txt
expire_in: 1 week
only:
- main
- merge_requests
# Plan for Prod
plan_prod:
stage: plan
extends: .terraform_base
script:
- cd environments/prod
- terraform init -backend-config=backend.tf
- terraform plan -var-file=terraform.tfvars -out=tfplan
- terraform show -json tfplan > tfplan.json
- terraform show tfplan > tfplan.txt
artifacts:
paths:
- environments/prod/tfplan
- environments/prod/tfplan.json
- environments/prod/tfplan.txt
expire_in: 1 week
only:
- main
- merge_requests
# Apply for Dev
apply_dev:
stage: apply
extends: .terraform_base
script:
- cd environments/dev
- terraform init -backend-config=backend.tf
- terraform apply -auto-approve tfplan > apply.log 2>&1
- terraform output -json > outputs.json
- aws s3 cp apply.log s3://my-audit-logs/terraform-apply/dev/$(date +%Y-%m-%d_%H-%M-%S).log || true
- aws s3 cp outputs.json s3://my-audit-logs/terraform-outputs/dev/$(date +%Y-%m-%d_%H-%M-%S).json || true
artifacts:
paths:
- environments/dev/apply.log
- environments/dev/outputs.json
expire_in: 1 year
dependencies:
- plan_dev
when: manual
only:
- main
# Apply for Staging
apply_staging:
stage: apply
extends: .terraform_base
script:
- cd environments/staging
- terraform init -backend-config=backend.tf
- terraform apply -auto-approve tfplan > apply.log 2>&1
- terraform output -json > outputs.json
- aws s3 cp apply.log s3://my-audit-logs/terraform-apply/staging/$(date +%Y-%m-%d_%H-%M-%S).log || true
- aws s3 cp outputs.json s3://my-audit-logs/terraform-outputs/staging/$(date +%Y-%m-%d_%H-%M-%S).json || true
artifacts:
paths:
- environments/staging/apply.log
- environments/staging/outputs.json
expire_in: 1 year
dependencies:
- plan_staging
when: manual
only:
- main
# Apply for Prod
apply_prod:
stage: apply
extends: .terraform_base
script:
- cd environments/prod
- terraform init -backend-config=backend.tf
- terraform apply -auto-approve tfplan > apply.log 2>&1
- terraform output -json > outputs.json
- aws s3 cp apply.log s3://my-audit-logs/terraform-apply/prod/$(date +%Y-%m-%d_%H-%M-%S).log || true
- aws s3 cp outputs.json s3://my-audit-logs/terraform-outputs/prod/$(date +%Y-%m-%d_%H-%M-%S).json || true
artifacts:
paths:
- environments/prod/apply.log
- environments/prod/outputs.json
expire_in: 1 year
dependencies:
- plan_prod
when: manual
only:
- main
environment:
name: production
# Backup State for Dev
backup_state_dev:
stage:
backup
extends:
.terraform_base
script:
- cd environments/dev
- terraform init - backend-config=backend.tf
- aws s3 cp s3://my-terraform-state/data-infra/dev/terraform.tfstate s3://my-terraform-state-backup/data-infra/dev/$(date +%Y-%m-%d_%H-%M-%S).tfstate || true
when:
always
only:
- main
dependencies:
- apply_dev
# Backup State for Staging
backup_state:
stage: backup
extends: .terraform_base
script:
- cd environments/staging
- terraform init -backend-config=backend.tf
- aws s3 cp s3://my-terraform-state/data-infra/staging/terraform.tfstate s3://my-terraform-state-backup/data-infra/staging/$(date +%Y-%m-%d_%H-%M-%S).tfstate || true
when: always
only:
- main
dependencies:
- apply_staging
# Backup State for Prod
backup_state_prod:
stage: apply
extends: .terraform_base
script:
- cd environments/prod
- terraform init -backend-config=backend.tf
- aws s3 cp s3://my-terraform-state/data-infra/prod/terraform.tfstate s3://my-terraform-state-backup/data-infra/prod/$(date +%Y-%m-%d_%H-%M-%S).tfstate || true
when: always
only:
- main
dependencies:
- apply_prod
Reviewing Plan, Apply, and Output Artifacts¶
Plan artifacts (tfplan
, tfplan.json
, tfplan.txt
), apply logs (apply.log
), and output files (outputs.json
) are generated during the plan
and apply
stages and stored in GitLab for review, auditing, and automation.
Steps to Review¶
-
Access Artifacts: In GitLab, navigate to the pipeline or merge request, go to the
plan_dev
,plan_staging
,plan_prod
,apply_dev
,apply_staging
, orapply_prod
job, and download the artifacts. -
Review Plan Artifacts:
- Open
tfplan.txt
for a human-readable summary of proposed changes. - Use
tfplan.json
for programmatic analysis with tools likejq
(e.g.,jq '.resource_changes[] | {address: .address, change: .change.actions}' tfplan.json
).
- Open
-
Review Apply Logs:
- Open
apply.log
to review the changes applied, including resources created, updated, or deleted. - For long-term audit, retrieve logs from S3:
aws s3 cp s3://my-audit-logs/terraform-apply/<env>/<timestamp>.log .
- Open
-
Review Output Artifacts:
- Open
outputs.json
to inspect dynamically generated values (e.g., S3 bucket ARNs, RDS endpoints). - Parse with
jq
(e.g.,jq '.data_lake_arn.value' outputs.json
). - For audit or integration, retrieve from S3:
aws s3 cp s3://my-audit-logs/terraform-outputs/<env>/<timestamp>.json .
- Open
-
Team Review:
- Share artifact links in merge request comments or integrate with notification tools (e.g., Slack) to alert reviewers.
- Require approval from team members before triggering
apply
jobs.
-
Apply Changes: After approving plan artifacts, trigger the corresponding
apply
job, which uses thetfplan
artifact, generatesapply.log
, and capturesoutputs.json
.
Example Workflow¶
- A developer pushes a change to the
main
branch, triggeringplan_dev
. - The team downloads
environments/dev/tfplan.txt
to review changes (e.g., new S3 bucket). - They check
tfplan.json
withjq
to verify compliance (e.g., correct instance types). - After approval, a team member triggers
apply_dev
, generatingapply.log
andoutputs.json
, which are stored as artifacts and uploaded to S3 for audit. - The team reviews
outputs.json
to confirm resource endpoints (e.g.,data_lake_arn
).
11. Best Practices (General)¶
- Modularize: Break infrastructure into reusable modules (e.g., S3, RDS, Databricks, Glue).
- Version Control: Store Terraform code in Git, with separate branches for environments.
- State Security: Use remote backends with encryption and locking (e.g., S3 + DynamoDB).
- Least Privilege: Use IAM roles with minimal permissions for Terraform execution.
- Tagging: Enforce consistent tagging for cost allocation and resource tracking.
- DRY Principle: Use variables, modules, and
for_each
to avoid duplication. - Testing: Use
terraform plan
to validate changes; consider tools liketerratest
. - CI/CD Integration: Integrate with GitLab CI/CD for automated deployments.
-
Environment Isolation: Use separate folders or state files for
dev
,staging
,prod
instead of workspaces for better clarity and flexibility. -
RDS-Specific Best Practices:
- Use
multi_az
for production to ensure high availability. - Store
db_password
in AWS Secrets Manager and reference it viadata
sources. - Enable automated backups with appropriate retention periods.
- Use parameter groups to tune database performance per environment.
- Use
-
Glue-Specific Best Practices:
- Store PySpark scripts in S3 and reference them in
script_location
. - Use environment-specific worker types and counts to optimize cost and performance.
- Enable CloudWatch metrics and logs for job monitoring.
- Use IAM roles with specific permissions for S3, CloudWatch, and other services.
- Store PySpark scripts in S3 and reference them in
12. Best Practices for CI/CD Pipelines with IaC¶
- Separate Environments: Use distinct jobs for each environment (
dev
,staging
,prod
) to prevent cross-contamination. This project uses separate folders (environments/<env>
) instead of Terraform workspaces for clear isolation and environment-specific configurations. - Manual Approvals: Require manual triggers for
apply
jobs, especially for production, to avoid unintended changes. - State Management: Use remote backends with locking (e.g., S3 with DynamoDB) to prevent concurrent state modifications.
- Secrets Management: Store sensitive data (e.g., AWS credentials, DB passwords) in GitLab CI/CD variables or a secrets manager like AWS Secrets Manager.
- Linting and Validation: Run
terraform validate
and tools liketflint
in the pipeline to catch errors early. - Plan Artifacts: Store
terraform plan
outputs to (tfplan
,tfplan.json
,tfplan.txt
) for review before applying, with JSON for automation and text for human-readable review. - Apply Log Retention: Store
terraform apply
logs (apply.log
) as GitLab artifacts (e.g., for 1 year) or upload to S3 (e.g., my-audit-logs) for long-term audit retention, with encryption and access controls to meet compliance requirements. -
Output Management: Capture
terraform output -json
asoutputs.json
artifacts for review, audit, or integration with other systems. Avoid sensitive data in outputs and store in secure storage (e.g., S3) for long-term retention. -
Data Source Management:
- Minimize data source queries to reduce API calls and improve pipeline performance.
- Use precise filters (e.g., tags, IDs) to avoid ambiguous results.
- Handle missing resources with
count
ortry
to prevent pipeline failures. - Validate data source resolution in the
plan
stage to catch errors early. - Document data source dependencies in
README.md
or comments in.tf
files.
-
Secure Artifacts: Restrict access to artifacts to authorized users to protect sensitive data in plan, apply, log, and output files.
- Role-Based Access: Use least-privilege IAM roles for CI/CD runners, scoped to specific environments.
- Testing: Integrate testing tools like
terratest
orcheckov
for unit, integration, and compliance testing of Terraform modules and plans. - Pipeline Triggers: Run pipelines on merge requests and main branch pushes to catch issues early.
-
Version Pinning: Pin Terraform and provider versions in the pipeline to avoid breaking changes.
-
Rollback Strategy:
- State Backups: Regularly back up
terraform.tfstate
to a separate S3 bucket afterapply
to enable recovery from failures or unintended changes. Use versioning on the backup bucket for additional protection. - Terraform Destroy: Document procedures for
terraform destroy
to remove all managed resources in an environment, including prerequisites (e.g., emptying S3 buckets) and post-cleanup steps (e.g., removing state files). - Manual Rollback: Maintain documentation for manual rollback of specific resources (e.g., reverting RDS instance types, deleting Glue jobs) when
destroy
is too destructive, including steps to restore from backups or previous state files. - Implementation: Automate state backups in the CI/CD pipeline post-
apply
. Include rollback instructions inREADME.md
or a dedicatedROLLBACK.md
.
- State Backups: Regularly back up
-
Monitoring and Logging: Enable verbose logging (
TF_LOG=DEBUG
) for debugging and monitor pipeline logs in GitLab. - Code Reviews: Require peer reviews for merge requests to ensure quality and catch potential issues.
- Documentation: Include pipeline setup, artifact review, log retention, rollback procedures, output management, and data source usage instructions in
README.md
.
Rollback Strategy Details¶
State Backups¶
- Why: The
terraform.tfstate
file tracks resource mappings, critical for recovery after failedapply
or unintended changes. -
How:
- Configure the pipeline to copy
terraform.tfstate
to a backup S3 bucket post-apply
(seebackup_state_*
jobs). - Enable versioning on the backup bucket (
my-terraform-state-backup
) to retain historical states. - Example:
aws s3 cp s3://my-terraform-state/data-infra/dev/terraform.tfstate s3://my-terraform-state-backup/data-infra/dev/$(date +%Y-%m-%d_%H-%M-%S).tfstate
- Configure the pipeline to copy
-
Restore Process:
- Identify the backup state file (e.g., via S3 console or
aws s3 ls
). - Copy to the active state location:
aws s3 cp s3://my-terraform-state-backup/data-infra/dev/<timestamp>.tfstate s3://my-terraform-state/data-infra/dev/terraform.tfstate
. - Run
terraform plan
to verify alignment with infrastructure. - Apply changes if needed:
terraform apply
.
- Identify the backup state file (e.g., via S3 console or
-
Best Practices:
- Encrypt backups with S3 server-side encryption.
- Restrict access via IAM policies.
- Schedule regular backup jobs or trigger post-
apply
. - Test restoration periodically to ensure reliability.
Terraform Destroy¶
- Why:
terraform destroy
removes all Terraform-managed resources in an environment, useful for complete rollback. -
How:
- Run
terraform destroy
in the environment directory (e.g.,environments/dev
). - Example:
cd environments/dev; terraform init; terraform destroy -var-file=terraform.tfvars
.
- Run
-
Prerequisites:
- Empty S3 buckets (delete objects or disable versioning).
- Remove dependent resources not managed by Terraform (e.g., manually created RDS snapshots).
- Verify state file integrity before execution.
-
Post-Cleanup:
- Remove state file from S3:
aws s3 rm s3://my-terraform-state/data-infra/dev/terraform.tfstate
. - Update lock table (e.g., DynamoDB) if necessary.
- Remove state file from S3:
-
Documentation:
- Create a
ROLLBACK.md
with steps:# Terraform Destroy Rollback 1. Navigate to the environment: `cd environments/<env>`. 2. Initialize: `terraform init -backend-config=backend.tf`. 3. Empty S3 buckets: `aws s3 rm s3://<bucket> --recursive`. 4. Run destroy: `terraform destroy -var-file=terraform.tfvars`. 5. Remove state: `aws s3 rm s3://my-terraform-state/data-infra/<env>/terraform.tfstate`.
-
Risks:
-
Destructive; unsuitable for partial rollbacks.
- Requires careful validation to avoid data loss.
- Create a
Manual Rollback¶
- Why: For targeted rollbacks when
destroy
is too aggressive (e.g., reverting an RDS instance type change). -
How:
- Modify Terraform code to revert changes (e.g., set
instance_class
back to original value). - Use a previous state backup to restore resource mappings.
-
Example (RDS rollback):
-
Restore state:
aws s3 cp s3://my-terraform-state-backup/data-infra/dev/<timestamp>.tfstate s3://my-terraform-state/data-infra/dev/terraform.tfstate
. - Update
main.tf
or variables (e.g.,instance_class = "db.t3.micro"
). - Run
terraform plan
andterraform apply
.
- Modify Terraform code to revert changes (e.g., set
-
Specific Cases:
- S3: Delete objects or restore from versioning.
- RDS: Revert instance type, restore from snapshot, or adjust parameters.
- Glue: Delete jobs or revert configurations via Terraform.
- Snowflake: Drop schemas/tables or restore from time travel.
-
Documentation:
-
In
ROLLBACK.md
, detail resource-specific rollback steps:## Manual Rollback # RDS Example 1. Restore state: `aws s3 cp s3://my-terraform-state-backup/...`. 2. Update `instance_class` in `modules/rds/main.tf`. 3. Run `terraform plan` and `terraform apply`. ## S3 Bucket 1. Delete objects: `aws s3 rm s3://<bucket> --recursive`. 2. Update Terraform to remove or bucket or adjust policies.
-
Best Practices:
- Test rollbacks in
dev
before applying to production. - Maintain backups (e.g., RDS snapshots) before applying changes.
- Log all rollback actions for audit purposes.
- Test rollbacks in
13. Common Gotchas¶
- State Drift: Use
terraform refresh
to sync state with actual infrastructure. - Provider Versions: Pin versions to avoid breaking changes.
- Resource Dependencies: Use
depends_on
or implicit dependencies (e.g., referencing outputs). - Sensitive Data: Store secrets in AWS Secrets Manager or HashiCorp Vault, not
.tfvars
. -
Rate Limits: Handle cloud provider API limits with
terraform plan
retries or delays. -
RDS-Specific Gotchas:
- Changing
allocated_storage
orinstance_class
may cause downtime. - Ensure
subnet_ids
are in the same VPC assecurity_group_ids
. - Avoid
skip_final_snapshot
in production to prevent data loss.
- Changing
-
Glue-Specific Gotchas:
- Ensure the Glue IAM role has permissions for S3 bucket script and temp directories.
- Verify
glue_version
compatibility with PySpark script dependencies. - Monitor job execution time to avoid unexpected costs in production.
-
CI/CD-Specific Gotchas:
- Ensure CI/CD runners have correct permissions for each environment.
- Avoid caching sensitive state files or logs in CI/CD pipelines.
- Test pipeline changes in a non-production branch first.
-
Output-Specific Gotchas:
- Avoid outputting sensitive data (e.g., database passwords) in
outputs.tf
. - Outputs are only updated after
terraform apply
orterraform refresh
if state changes. - Missing outputs in
outputs.tf
can’t be retrieved without redefining them.
- Avoid outputting sensitive data (e.g., database passwords) in
-
Data Source-Specific Gotchas:
- Missing resources cause errors unless handled (e.g., with
count
ortry
). - Ambiguous data sources (e.g., no filters) may return multiple results, causing failures.
- Excessive API queries from data sources can hit provider rate limits, slowing pipelines.
- Ensure provider versions support required data source attributes.
- Missing resources cause errors unless handled (e.g., with
14. Advanced Features¶
-
Dynamic Blocks: Generate repetitive resource configurations.
resource "aws_s3_bucket" "bucket" { bucket = "${var.environment}-${var.bucket_name}" dynamic "lifecycle_rule" { for_each = var.lifecycle_rules content { id = lifecycle_rule.value.id status = "Enabled" expiration { days = lifecycle_rule.value.days } } } }
-
Count and For_Each: Create multiple resources dynamically.
resource "aws_s3_bucket" "buckets" { for_each = var.bucket_names bucket = "${each.key}-${var.environment}" }
-
Data Sources:
- Definition: Query existing resources or external data without managing them to retrieve attributes (e.g., VPC IDs, AMI IDs, database names).
-
Syntax:
Example: Fetch a VPC:data "<PROVIDER>_<TYPE>" "<NAME>" { [CONFIG] }
data "aws_vpc" "default" { default = true }
-
Use Cases:
- Reference existing infrastructure (e.g., VPCs, IAM roles).
- Fetch dynamic values (e.g., latest AMI ID, availability zones).
- Integrate with other Terraform state files (e.g.,
terraform_remote_state
). - Query external APIs or scripts (e.g.,
external
provider).
-
Examples:
-
AWS: Fetch subnets for an RDS instance:
data "aws_vpc" "default" { default = true } data "aws_subnets" "default_subnets" { filter { name = "vpc-id" values = [data.aws_vpc.default.id] } } resource "aws_db_instance" "rds" { db_subnet_group_name = aws_db_subnet_group.rds.name vpc_security_group_ids = [aws_security_group.rds.id] subnet_ids = data.aws_subnets.default_subnets.ids }
-
Remote State: Access another project’s outputs:
# Fetch outputs from another Terraform configuration data "terraform_remote_state" "network" { backend = "s3" config = { bucket = "my-terraform-state" key = "networking/terraform.tfstate" region = "us-east-1" } } # Use the remote state output in a resource resource "aws_s3_bucket" "bucket" { bucket = "${var.environment}-data-lake" tags = { VPC = data.terraform_remote_state.network.outputs.vpc_id } }
- External Data: Fetch data from a script:
data "external" "config" { program = ["bash", "${path.module}/get_config.sh"] } resource "aws_s3_bucket" "bucket" { bucket = "${var.environment}-${data.external.config.result.bucket_suffix}" }
-
-
Best Practices:
- Minimize queries to reduce API calls and improve performance.
- Use specific filters (e.g.,
tags
,id
) to avoid ambiguity. - Handle missing resources with
count
ortry
:data "aws_vpc" "existing" { count = var.vpc_id != "" ? 1 : 0 id = var.vpc_id }
- Avoid sensitive data in data source outputs.
- Pin provider versions to ensure data source compatibility.
- Document usage with comments in
.tf
files.
-
Gotchas:
- Missing resources cause errors unless handled.
- Ambiguous results (e.g., no filters) cause failures.
- Rate limits may slow pipelines with many data sources.
- Ensure state file access for
terraform_remote_state
.
-
examples:
- Snowflake:
data.snowflake_role
to fetch an existing role. - Databricks:
data.databricks_spark_version
to get the latest Spark version. - Glue Crawler:
data.aws_glue_catalog_database
to reference an existing database. - RDS:
data.aws_vpc
anddata.aws_subnets
to fetch VPC and subnets. - Glue Job:
data.terraform_remote_state
to access the S3 bucket from another configuration.
- Snowflake:
15. Debugging¶
- Verbose Logging:
export TF_LOG=DEBUG
- State Inspection:
terraform state show aws_s3_bucket.bucket
- Plan Analysis:
terraform plan -out=tfplan; terraform show -json tfplan
- Output Inspection:
terraform output -json
orterraform output <output_name>
- Data Source Debugging: Check
TF_LOG=DEBUG
for data source query details; useterraform plan
to verify fetched values. - Error Handling: Check provider-specific error codes in logs.
16. Tools & Extensions¶
- Terraform CLI: Core tool for managing infrastructure.
- tfenv: Manage multiple Terraform versions.
- tflint: Linter for Terraform code.
- terraform-docs: Generate documentation from
.tf
files. - VS Code Terraform Extension: Syntax highlighting and autocompletion.
Notes¶
- Provider Requirements: Ensure providers (e.g.,
aws
,snowflake
,databricks
) are configured interraform.tf
ormain.tf
for data sources to work. The cheatsheet’s Section 5 already includes AWS and Snowflake providers. - IAM Permissions: Data sources require read permissions for the queried resources (e.g.,
ec2:DescribeVpcs
fordata.aws_vpc
). Ensure the CI/CD runner’s IAM role includes these. - Pipeline Integration: Data sources are resolved in the
plan
andapply
jobs without changes to.gitlab-ci.yml
. Monitor pipeline logs for data source errors. - Testing: Run
terraform plan
locally inenvironments/dev
to verify data sources fetch expected values. Checktfplan.json
for details. - Use
data
blocks to fetch existing resources (e.g., VPCs, Snowflake roles) or dynamic values (e.g., AMI IDs). - Define in
main.tf
(e.g.,data.aws_vpc.default
) and reference in resources or modules. - Document dependencies in
main.tf
comments. - Validate in CI/CD
plan
jobs to catch errors. - Minimize queries, use filters, handle missing resources, validate in
plan
, and document dependencies.
Appendix¶
A More Detailed Look at Data Sources¶
-
Definition: Data sources in Terraform allow you to fetch information about existing resources or external data that are not managed by your current Terraform configuration. They provide a way to query cloud providers, APIs, or other systems to retrieve attributes like IDs, ARNs, or configurations without creating or modifying resources.
-
Purpose:
- Reference resources created outside Terraform (e.g., an existing VPC or IAM role).
- Access computed values (e.g., the latest AMI ID, availability zones).
- Integrate with other Terraform configurations via remote state.
- Avoid hardcoding values by dynamically retrieving them.
-
How They Work:
- Data sources are defined using the
data
block in Terraform configuration files (e.g.,main.tf
). - They query the provider’s API (e.g., AWS, Snowflake, Databricks) during
terraform plan
orterraform apply
to fetch data. - The retrieved data is stored in the Terraform state file for the duration of the operation but does not manage the underlying resource.
- Data source attributes can be referenced in other resources, outputs, or modules, just like managed resources.
- Data sources are defined using the
Syntax and Structure¶
A data source block follows this format:
data "<PROVIDER>_<TYPE>" "<NAME>" {
[CONFIGURATION]
}
- PROVIDER: The provider (e.g.,
aws
,snowflake
,databricks
). - TYPE: The resource type to query (e.g.,
aws_vpc
,snowflake_database
). - NAME: A local name for referencing the data source in your configuration.
- CONFIGURATION: Filters or parameters to identify the resource (e.g.,
id
,name
,tags
).
Example:
data "aws_vpc" "default" {
default = true
}
- This queries AWS for the default VPC and makes its attributes (e.g.,
id
,cidr_block
) available asdata.aws_vpc.default.<attribute>
.
Key Characteristics¶
- Read-Only: Data sources only fetch data; they cannot create, update, or delete resources.
- Provider Dependency: Each data source is tied to a specific provider, which must be configured in your Terraform setup (e.g.,
provider "aws"
). - State Integration: Data source results are cached in the state file during execution but do not persist as managed resources.
- Dynamic Values: Data sources are ideal for retrieving values that change over time (e.g., latest AMI IDs, external resource IDs).
- Lifecycle: Data sources are refreshed during
terraform plan
orterraform apply
unless explicitly skipped (e.g., using-refresh=false
).
Common Use Cases¶
- Referencing Existing Infrastructure:
- Fetch details of resources created manually or by other Terraform configurations (e.g., VPCs, subnets, IAM roles).
- Example: Use an existing security group for an RDS instance.
- Dynamic Resource Configuration:
- Retrieve dynamic values like the latest AMI ID or availability zones to configure resources.
- Example: Select the latest Ubuntu AMI for an EC2 instance.
- Cross-Module Integration:
- Access outputs from another Terraform configuration using the
terraform_remote_state
data source. - Example: Reference an S3 bucket ARN from a different project’s state.
- Environment-Specific Configurations:
- Query environment-specific resources (e.g., a Snowflake database in
prod
). - Example: Fetch a Snowflake database name for a schema.
- External Data:
- Use the
http
orexternal
provider to fetch data from APIs or scripts. - Example: Retrieve a configuration value from an external API.
Example: Fetching an Existing VPC¶
# Fetch the default VPC
data "aws_vpc" "default" {
default = true
}
# Fetch subnets in the VPC
data "aws_subnets" "default_subnets" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
# Use the subnet IDs in an RDS instance
module "rds_postgres" {
source = "../../modules/rds"
db_name = "datawarehouse"
environment = var.environment
subnet_ids = data.aws_subnets.default_subnets.ids
security_group_ids = [aws_security_group.rds_sg.id]
}
- Here,
data.aws_vpc.default
retrieves the default VPC’s ID, anddata.aws_subnets.default_subnets
fetches its subnets, which are then used in the RDS module.
Example: Remote State Data Source¶
# Fetch outputs from another Terraform configuration
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "my-terraform-state"
key = "networking/terraform.tfstate"
region = "us-east-1"
}
}
# Use the remote state output in a resource
resource "aws_s3_bucket" "bucket" {
bucket = "${var.environment}-data-lake"
tags = {
VPC = data.terraform_remote_state.network.outputs.vpc_id
}
}
vpc_id
output from a separate Terraform configuration’s state file.
Example: External Data Source¶
# Use the external provider to run a script
data "external" "config" {
program = ["bash", "${path.module}/get_config.sh"]
}
# Use the result in a resource
resource "aws_s3_bucket" "bucket" {
bucket = "${var.environment}-${data.external.config.result.bucket_suffix}"
}
- The
get_config.sh
script returns JSON (e.g.,{"bucket_suffix": "data-lake"}
), which is used to name the bucket.
Integration with CI/CD Setup¶
- Folder-Based Structure: The project uses separate directories (
environments/dev
,environments/staging
,environments/prod
) for environment isolation. Data sources can be defined in each environment’smain.tf
to fetch environment-specific resources (e.g., aprod
VPC). - CI/CD Pipeline: Data sources are queried during
terraform plan
andterraform apply
jobs in the GitLab pipeline, ensuring dynamic values are resolved automatically. - Modules: Data sources can be used in modules (e.g.,
modules/rds
) to reference existing infrastructure, as shown in the RDS example above. - State Management: Data source results are stored in the environment-specific state file (e.g.,
data-infra/dev/terraform.tfstate
) during execution.
Best Practices¶
- Minimize Queries: Avoid excessive data source queries to reduce API calls and improve performance.
- Use Filters: Specify precise filters (e.g.,
tags
,id
) to avoid ambiguous results. - Handle Missing Resources:
- Use
count
ortry
to handle cases where a data source might not return results. - Example:
data "aws_vpc" "existing" { count = var.vpc_id != "" ? 1 : 0 id = var.vpc_id }
- Avoid Sensitive Data: Don’t expose sensitive attributes (e.g., secrets) in data source outputs.
- Pin Provider Versions: Ensure the provider version supports the data source type and attributes.
- Document Usage: Include comments in
main.tf
explaining why a data source is used. - Test in CI/CD: Verify data sources resolve correctly in the pipeline’s
plan
stage.
Common Gotchas¶
- Missing Resources: If a data source can’t find a resource, Terraform will error unless handled (e.g., with
count
ortry
). - Stale Data: Data sources rely on provider APIs; ensure the provider is up-to-date to avoid stale results.
- Rate Limits: Excessive data source queries may hit provider API limits, causing failures in
plan
orapply
. - State Dependency: Data sources are resolved during execution, so ensure the state file is accessible (e.g., in remote backends).
- Ambiguous Results: Without proper filters, data sources may return multiple results, causing errors.
A deeper look into Environment Isolation Approaches¶
Folder-Based Isolation¶
Folder-based isolation involves organizing Terraform configurations into separate directories for each environment (e.g., environments/dev/
, environments/staging/
, environments/prod/
). Each directory contains its own main.tf
, variables.tf
, outputs.tf
, backend.tf
, and terraform.tfvars
, with distinct state files and configurations tailored to the environment.
Terraform Workspaces¶
Terraform workspaces allow multiple environments to share the same configuration files within a single directory, using different state files for each workspace (e.g., default
, dev
, staging
, prod
). Workspaces are managed with commands like terraform workspace new <env>
and terraform workspace select <env>
, and variables are typically controlled via CLI flags or conditional logic.
Advantages of Folder-Based Isolation¶
The folder-based isolation approach, as demonstrated in the cheatsheet's environments/
structure, offers several advantages over workspaces, particularly for data infrastructure projects involving complex resources like S3 buckets, RDS databases, Snowflake, Databricks, and Glue jobs.
1. Clear Separation of Configurations¶
- Advantage: Each environment has its own dedicated directory with independent configuration files, making it easier to customize resources, variables, and outputs without relying on conditional logic.
- Example: In the cheatsheet,
environments/dev/main.tf
might define a smaller RDS instance (db.t3.micro
) compared toenvironments/prod/main.tf
(db.m5.large
). This is explicit and avoids complexif
statements or workspace-specific variables. - Workspace Challenge: With workspaces, all environments share the same
main.tf
, requiring conditional logic (e.g.,count = terraform.workspace == "prod" ? 1 : 0
) or variable overrides, which can lead to errors or reduced readability. - Impact: Folder-based isolation improves maintainability and reduces the risk of misconfiguration, especially for data engineers managing diverse infrastructure across environments.
2. Independent State Files and Backends¶
- Advantage: Each environment has its own state file and backend configuration, stored in separate paths (e.g.,
s3://my-terraform-state/data-infra/dev/terraform.tfstate
vs.s3://my-terraform-state/data-infra/prod/terraform.tfstate
), ensuring complete isolation. - Example: The cheatsheet's
environments/dev/backend.tf
specifies:This isolatesterraform { backend "s3" { bucket = "my-terraform-state" key = "data-infra/dev/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" } }
dev
state fromprod
, preventing accidental cross-environment changes. - Workspace Challenge: Workspaces store state files in the same backend with a workspace prefix (e.g.,
env:/dev/terraform.tfstate
), which can lead to accidental state overwrites if the wrong workspace is selected. Additionally, all workspaces share the same backend configuration, limiting flexibility. - Impact: Folder-based isolation enhances security and auditability by ensuring state files are distinctly managed, critical for compliance in data infrastructure projects.
3. Simplified CI/CD Pipelines¶
- Advantage: Folder-based isolation aligns naturally with CI/CD pipelines, as each environment's directory can be targeted independently, reducing complexity in pipeline scripts.
- Example: The cheatsheet's
.gitlab-ci.yml
defines separate jobs (plan_dev
,apply_dev
,plan_staging
, etc.) that operate on specific directories (e.g.,cd environments/dev
). This avoids the need to switch workspaces in the pipeline, simplifying configuration and reducing errors.plan_dev: stage: plan script: - cd environments/dev - terraform init -backend-config=backend.tf - terraform plan -var-file=terraform.tfvars -out=tfplan
- Workspace Challenge: With workspaces, pipelines must run
terraform workspace select <env>
before each job, increasing the risk of selecting the wrong workspace or encountering state conflicts. This adds complexity to pipeline scripts and requires careful error handling. - Impact: Folder-based isolation streamlines GitLab CI/CD pipelines, improving reliability and auditability for automated deployments of data infrastructure.
4. Enhanced Team Collaboration¶
- Advantage: Separate directories make it easier for teams to work on different environments simultaneously without conflicts, as each environment's configuration and state are isolated.
- Example: In the cheatsheet, a developer can modify
environments/dev/main.tf
to test a new Glue job while another team member updatesenvironments/prod/main.tf
to adjust RDS settings, without risking state or configuration clashes. - Workspace Challenge: Workspaces share a single directory, so concurrent changes to
main.tf
or state files can lead to conflicts, especially in large teams. Developers must carefully coordinate workspace selection to avoid overwriting each other's changes. - Impact: Folder-based isolation supports parallel development, critical for data engineering teams managing complex, multi-environment setups.
5. Better Auditability and Traceability¶
- Advantage: Folder-based isolation provides clear, environment-specific configuration files and state files, making it easier to audit changes and track infrastructure history.
- Example: The cheatsheet's structure allows auditors to review
environments/prod/terraform.tfvars
andapply.log
(stored as artifacts) to verify production settings and changes. State backups ins3://my-terraform-state/backups/prod/
provide a clear history for rollback or compliance. - Workspace Challenge: Workspaces mix configurations in a single
main.tf
, requiring auditors to parse conditional logic or variable overrides to understand environment-specific settings. State files are less distinctly separated, complicating audit trails. - Impact: Folder-based isolation simplifies compliance with regulatory requirements (e.g., GDPR, HIPAA) by providing transparent, environment-specific records.
6. Flexibility for Environment-Specific Customization¶
- Advantage: Each environment can have unique configurations, providers, or even Terraform versions without affecting others, offering maximum flexibility.
- Example: In the cheatsheet,
environments/staging/
might use a different AWS region or provider version thanenvironments/prod/
, defined in their respectivemain.tf
andbackend.tf
. This is straightforward with separate directories. - Workspace Challenge: Workspaces share the same provider configuration and Terraform version, limiting customization unless complex workarounds (e.g., provider aliases) are used.
- Impact: Folder-based isolation supports diverse data infrastructure requirements, such as regional differences or provider-specific settings for Snowflake or Databricks.
7. Reduced Risk of Human Error¶
- Advantage: Operating within a specific directory (e.g.,
environments/dev/
) eliminates the need to select a workspace, reducing the chance of applying changes to the wrong environment. - Example: Running
terraform apply
inenvironments/dev/
affects only thedev
environment, with no risk of accidentally targetingprod
. The cheatsheet's pipeline reinforces this by scoping jobs to directories. - Workspace Challenge: Forgetting to run
terraform workspace select prod
beforeterraform apply
can result in catastrophic changes to the wrong environment, a common error in high-pressure scenarios. - Impact: Folder-based isolation enhances safety, particularly for production data infrastructure where errors can lead to data loss or downtime.
When to Use Workspaces¶
While folder-based isolation is generally preferred, workspaces may be suitable for:
- Simple Projects: Small projects with minimal configuration differences between environments (e.g., a single S3 bucket with different names).
- Temporary Environments: Spinning up ephemeral environments (e.g., feature branches) where maintaining separate directories is overkill.
- Legacy Projects: Existing setups already using workspaces, where migration to folders is not cost-effective.
However, for data infrastructure projects with complex resources, multiple team members, and strict compliance requirements, folder-based isolation (as in the cheatsheet) is superior.
Recommendations¶
- Adopt Folder-Based Isolation: Use the cheatsheet's
environments/
structure to organizedev
,staging
, andprod
configurations in separate directories. - Document in
README.md
: Include setup instructions for navigating environment directories and running Terraform commands, as shown in the cheatsheet:## Running Terraform - Navigate to the environment directory: `cd environments/dev` - Initialize: `terraform init` - Apply: `terraform apply -var-file=terraform.tfvars`
- Leverage CI/CD: Configure pipelines (e.g.,
.gitlab-ci.yml
) to target specific directories, as in the cheatsheet, to enforce environment isolation. - Secure State Files: Use distinct backend paths and versioning (e.g.,
s3://my-terraform-state/data-infra/<env>/terraform.tfstate
) to protect state files, as shown in the cheatsheet'sbackend.tf
.