Routinely Managing Information Pipeline Infrastructures With Terraform | by João Pedro | Might, 2023

Photograph by EJ Yao on Unsplash

Just a few weeks in the past, I wrote a submit about developing a data pipeline using both on-premise and AWS tools. This submit is a part of my current effort in bringing extra cloud-oriented information engineering posts.

Nonetheless, when mentally reviewing this submit, I seen an enormous drawback: the handbook work.

Every time I develop a brand new venture, whether or not actual or fictional, I at all times attempt to cut back the friction of configuring the setting (set up dependencies, configure folders, receive credentials, and many others) and that’s why I at all times use Docker. With it, I simply cross you a docker-compose.yaml file + a couple of Dockerfiles and you’re able to creating precisely the identical setting as me with only one command — docker compose up.

Nonetheless, after we wish to develop a brand new information venture with cloud instruments (S3, Lambda, Glue, EMR, and many others) Docker can’t assist us, because the elements must be instantiated within the suppliers’ infrastructure, and there are two fundamental methods of doing this: Manually on the UI or programmatically by way of service APIs.

For instance, you may entry the AWS UI in your browser, seek for S3 and create a brand new bucket manually, or write a code in Python to create this similar occasion making a request on the AWS API.

Within the submit talked about earlier, I described step-by-step find out how to create the wanted elements MANUALLY by way of the AWS net interface. The end result? Even making an attempt to summarize as a lot as doable (and even omitting elements!), the submit ended up with 17 min, 7 min greater than I often do, filled with PRINTS of which display screen you must entry, the place you must click on, and which settings to decide on.

Along with being a pricey, complicated, and time-consuming course of, it’s nonetheless vulnerable to human errors, which finally ends up bringing extra complications and presumably even unhealthy surprises within the month-to-month invoice. Positively an disagreeable course of.

And that is the precisely type of drawback that Terraform comes to unravel.

not sponsored.

Terraform is an IaC (Infrastructure as Code) instrument that manages infrastructure in cloud suppliers in an automated and programmatically method.

In Terraform, the specified infrastructures is described utilizing a declarative language known as HCL (HashiCorp Configuration Language), the place the elements are specified, e.g. a S3 bucket named “my-bucket” and an EC2 server with Ubuntu 22 within the us-east-1 zone.

The described assets are materialized by Terraform by way of calls within the cloud supplier’s service APIs. Past creation, additionally it is able to destroying and updating the infrastructure, including/eradicating solely the assets wanted to maneuver from the precise state to the specified state, e.g. if 4 cases of EC2 are requested, it’s going to create solely 2 new cases if 2 others exist already. This habits is achieved as a result of Terraform shops the precise state of the infrastructure in state information.

Due to this, it is doable to handle a venture’s infrastructure in a way more agile and safe method, because it removes the handbook work wanted of configuring every particular person useful resource.

Terraform’s proposal is to be a cloud-agnostic IaC instrument, so it makes use of a standardized language to mediate the interplay with the cloud suppliers’ APIs, eradicating the necessity of studying find out how to work together with them straight. Nonetheless on this line, HCL language additionally helps variables manipulation and a sure diploma of ‘flux management’ (if-statements and loops), permitting the usage of conditionals and loops in useful resource creation, e.g. create 100 EC2 cases.

Final however not least, Terraform additionally permits infrastructure versioning, as its plain-text information could be simply manipulated by git.

As talked about earlier, this submit seeks to automate the method of infrastructure creation of my earlier submit.

To recap, the venture developed geared toward creating an information pipeline to extract questions from the Brazillian ENEM (Nationwide Examination of Excessive Faculty, on literal translation) exams utilizing the PDFs accessible on the MEC (Ministry of Schooling) website.

The method concerned three steps, managed by an area Airflow occasion. These steps included downloading and importing the PDF file to S3 storage, extracting texts from the PDFs utilizing a Lambda operate, and segmenting the extracted textual content into questions utilizing a Glue Job.

Notice that, for this pipeline to work, many AWS elements should be created and appropriately configured.

0. Establishing the setting

All of the code used on this venture is obtainable on this GitHub Repository.

You’ll want a machine with Docker and an AWS account.

Step one is configuring a brand new AWS IAM consumer for Terraform, this would be the solely step executed within the AWS net console.

Create a brand new IAM consumer with FullAccess to S3, Glue, Lambda, and IAM and generate code credentials for it.

This can be a lot of permission for a single consumer, so hold the credentials secure.

I’m utilizing FullAccess permissions as a result of I wanna make issues simpler for now, however at all times think about the ‘least privileged’ strategy when coping with credentials.

Now, again to the native setting.

On the identical path because the docker-compose.yaml file, create a .env file and write your credentials:


These variables might be handed to the docker-compose file for use by Terraform.

model: '3'
picture: hashicorp/terraform:newest
- ./terraform:/terraform
working_dir: /terraform
command: ["init"]

1. Create the Terraform file

Nonetheless on the identical folder create a brand new listing known as terraform. Inside it, create a brand new file, this might be our fundamental Terraform file.

This folder might be mapped contained in the container when it runs, so the interior Terraform will have the ability to see this file.

2. Configure the AWS Supplier

The very first thing we have to do is to configure the cloud supplier used.

terraform {
required_version = ">= 0.12"

required_providers {
aws = ">= 3.51.0"

variable "AWS_ACCESS_KEY_ID" {
kind = string

kind = string

kind = string

supplier "aws" {
access_key = var.AWS_ACCESS_KEY_ID
secret_key = var.AWS_SECRET_ACCESS_KEY

That is what a Terraform configuration file seems to be like — a set of blocks with differing types, each with a selected operate.

The terraform block fixes the variations for Terraform itself and for the AWS supplier.

A variable is strictly what the identify suggests — a worth assigned to a reputation that may be referenced all through the code.

As you in all probability already seen, our variables don’t have a worth assigned to them, so what’s occurring? The reply is again within the docker-compose.yaml file, the worth of those variables was set utilizing setting variables within the system. When a variable worth just isn’t outlined, Terraform will take a look at the worth of the setting variable TF_VAR_<var_name> and use its worth. I’ve opted for this strategy to keep away from hard-coding the keys.

The supplier block can also be self-explanatory — It references the cloud supplier we’re utilizing and configures its credentials. We set the supplier’s arguments (access_key, secret_key, and area) with the variables outlined earlier, referenced with the var.<var_name> notation.

With this block outlined, run:

docker compose run terraform init 

To arrange Terraform.

3. Creating our first useful resource: The S3 bucket

Terraform makes use of the useful resource block to reference infrastructure elements equivalent to S3 buckets and EC2 cases, in addition to actions like granting permissions to customers or importing information to a bucket.

The code beneath creates a brand new S3 bucket for our venture.

useful resource "aws_s3_bucket" "enem-bucket-terraform-jobs" {
bucket = "enem-bucket-terraform-jobs"

A useful resource definition follows the syntax:

useful resource <resource_type> <resource_name> {
argument_1 = "blah blah blah blah"
argument_2 = "blah blah blah"
argument_3 {

Within the case above, “aws_s3_bucket” is the useful resource kind, “enem-bucket-terraform-jobs” is the useful resource identify, used to reference this useful resource within the file (it isn’t the bucket identify within the AWS infrastructure). The argument bucket=“enem-bucket-terraform-jobs” assigns a reputation to our bucket.

Now, with the command:

docker compose run terraform plan

Terraform will evaluate the present state of the infrastructure and infer what must be performed to attain the specified state described within the file.

As a result of this bucket nonetheless doesn’t exist, Terraform will plan to create it.

To use Terraform’s plan, run

docker compose run terraform apply

And, with solely these few instructions, our bucket is already created.

Straightforward, proper?

To destroy the bucket, simply kind:

docker compose run terraform destroy

And Terraform takes care of the remainder.

These are the fundamental instructions that can comply with us till the tip of the submit: plan, apply, destroy. Any more, all that we’re going to do is configure the file, including the assets wanted to materialize our information pipeline.

4. Configuring the Lambda Perform half I: Roles and permissions

Now on the Lambda Perform definition.

This was one of many trickiest elements of my earlier submit as a result of, by default, Lambda capabilities already want a set of fundamental permissions and, on high of that, we had additionally to provide it learn and write permissions to the S3 bucket beforehand created.

To begin with, we should create a brand new IAM function.

# ==========================

useful resource "aws_iam_role" "lambda_execution_role" {
identify = "lambda_execution_role_terraform"
assume_role_policy = jsonencode({
# That is the coverage doc that permits the function to be assumed by Lambda
# different providers can't assume this function
Model = "2012-10-17"
Assertion = [
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = ""

When creating these items, I strongly counsel that you simply first ask what you need in ChatGPT, GitHub Copilot, or every other LLM buddy after which verify the supplier’s documentation on how this kind of useful resource works.

The code above creates a brand new IAM function and permits AWS Lambda Features to imagine it. The subsequent step is to connect the Lambda Primary Execution coverage to this function to permit the Lambda Perform to execute with out errors.

useful resource "aws_iam_role_policy_attachment" "lambda_basic_execution" {
policy_arn = "arn:aws:iam::aws:coverage/service-role/AWSLambdaBasicExecutionRole"
function = aws_iam_role.lambda_execution_role.identify

The good factor to notice within the code above is that we will reference useful resource attributes and cross them as arguments within the creation of recent assets. Within the case above, as a substitute of hard-coding the ‘function’ argument with the identify of the beforehand created function ‘lambda_execution_role_terraform’, we will reference this attribute utilizing the syntax:

Should you take a while to look into the Terraform documentation of a useful resource, you’ll notice that it has arguments and attributes. Arguments are what you cross so as to create/configure a brand new useful resource and attributes are read-only properties about this useful resource accessible after its creation.

Due to this, attributes are utilized by Terraform to implicitly handle dependencies between assets, establishing the suitable order of their creation.

The code beneath creates a brand new entry coverage for our S3 bucket, permitting fundamental CRUD operations on it.

useful resource "aws_iam_policy" "s3_access_policy" {
identify = "s3_access_policy"
coverage = jsonencode({
Model = "2012-10-17"
Assertion = [
Effect = "Allow"
Action = [
Useful resource = aws_s3_bucket.enem-data-bucket.arn

useful resource "aws_iam_policy_attachment" "s3_access_attachment" {
identify = "s3_and_lambda_execution_access_attachment"
policy_arn = aws_iam_policy.s3_access_policy.arn
roles = []

Once more, as a substitute of hard-coding the bucket’s ARN, we will reference this attribute utilizing aws_s3_bucket.enem-data-bucket.arn.

With the Lambda function appropriately configured, we will lastly create the operate itself.

useful resource "aws_lambda_function" "lambda_function" {
function_name = "my-lambda-function-aws-terraform-jp"
function = aws_iam_role.lambda_execution_role.arn
handler = "lambda_function.lambda_handler"
runtime = "python3.8"
filename = ""

The file is a compressed folder that will need to have a file with a lambda_handler(occasion, context) operate inside. It have to be on the identical path as the file.

def lambda_handler(occasion, context):
return "Hiya from Lambda!"

5. Configuring the Lambda Perform half II: Attaching a set off

Now, we have to configure a set off for the Lambda Perform: It should execute each time a brand new PDF is uploaded to the bucket.


useful resource "aws_lambda_permission" "allow_bucket_execution" {
statement_id = "AllowExecutionFromS3Bucket"
motion = "lambda:InvokeFunction"
function_name = aws_lambda_function.lambda_function.arn
principal = ""
source_arn = aws_s3_bucket.enem-data-bucket.arn

useful resource "aws_s3_bucket_notification" "bucket_notification" {
bucket =

lambda_function {
lambda_function_arn = aws_lambda_function.lambda_function.arn
occasions = ["s3:ObjectCreated:*"]
filter_suffix = ".pdf"

depends_on = [aws_lambda_permission.allow_bucket_execution]

This can be a case the place we should specify an express dependency between assets, because the “bucket_notification” useful resource must be created after the “allow_bucket_execution”.

This may be simply achieved through the use of the depends_on argument.

And we’re performed with the lambda operate, simply run:

docker compose run terraform apply

And the Lambda Perform might be created.

6. Including a module to the Glue job

Our file is getting fairly large, and do not forget that that is only a easy information pipeline. To reinforce the group and cut back its dimension, we will use the idea of modules.

A module is a set of assets grouped in a separate file that may be referenced and reused by different configuration information. Modules allow us to summary complicated elements of the infrastructure to make our code extra manageable, reusable, organized, and modular.

So, as a substitute of coding all of the assets wanted to create our Glue job within the file, we’ll put them inside a module.

Within the ./terraform folder, create a brand new folder ‘glue’ with a file inside it.

Then add a brand new S3 bucket useful resource within the file:

# Create a brand new bucket to retailer the job script
useful resource "aws_s3_bucket" "enem-bucket-terraform-jobs" {
bucket = "enem-bucket-terraform-jobs"

Again in, simply reference this module with:

module "glue" {
supply = "./glue"

And reinitialize terraform:

docker compose run terraform init

Terraform will restart its backend and initialize the module with it.

Now, if we run terraform plan, it ought to embody this new bucket within the creation listing:

Utilizing this module, we’ll have the ability to encapsulate all of the logic of making the job in a single exterior file.

A requirement of AWS Glue jobs is that their job information are saved in an S3 bucket, and that’s why we created “enem-bucket-terraform-jobs”. Now, we should add the job’s file itself.

Within the terraform path, I’d included a file, it’s simply an empty file used to simulate this habits. To add a brand new object to a bucket, simply use the “aws_s3_object” useful resource:

useful resource "aws_s3_object" "myjob" {
bucket =
key = ""
supply = ""

Any more, it’s only a matter of implementing the Glue function and creating the job itself.

useful resource "aws_iam_role" "glue_execution_role" {
identify = "glue_execution_role_terraform"
assume_role_policy = jsonencode({
# That is the coverage doc that permits the function to be assumed by Glue
# different providers can't assume this function
Model = "2012-10-17"
Assertion = [
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = ""

useful resource "aws_iam_role_policy_attachment" "glue_basic_execution" {
policy_arn = "arn:aws:iam::aws:coverage/service-role/AWSGlueServiceRole"
function = aws_iam_role.glue_execution_role.identify

Not so quick. We should guarantee that this job has the identical learn and write permissions to the bucket “enem-data-bucket” because the Lambda Perform, i.e. we have to connect the aws_iam_policy.s3_access_policy to its function.

However, as a result of this coverage was outlined in the primary file, we can’t reference it straight in our module.

# ATTACH THE THE S3 ACCESS POLICY s3_access_policy TO THE ROLE glue_execution_role
useful resource "aws_iam_policy_attachment" "s3_access_attachment_glue" {
identify = "s3_and_glue_execution_access_attachment"
policy_arn = aws_iam_policy.s3_access_policy.arn
roles = []

As a way to obtain this habits, we should cross the entry coverage arn as an argument to the module, and that’s fairly easy.

First, within the file, create a brand new variable to obtain the worth.

variable "enem-data-bucket-access-policy-arn" {
kind = string

Return to the primary file and, within the module reference, cross a worth to this variable.

module "glue" {
supply = "./glue"
enem-data-bucket-access-policy-arn = aws_iam_policy.s3_access_policy.arn

Lastly, within the glue file, use the worth of the variable within the useful resource.

# ATTACH THE THE S3 ACCESS POLICY s3_access_policy TO THE ROLE glue_execution_role
useful resource "aws_iam_policy_attachment" "s3_access_attachment_glue" {
identify = "s3_and_glue_execution_access_attachment"
policy_arn = var.enem-data-bucket-access-policy-arn
roles = []

Now, take a minute to consider the facility of what we had simply performed. With modules and arguments, we will create totally parametrized complicated infrastructures.

The code above doesn’t simply create a selected job for our pipeline. By simply altering the worth of the enem-data-bucket-access-policy-arn variable, we will create a brand new job to course of information from a wholly totally different bucket.

And that logic applies to something you need. It’s doable, for instance, to concurrently create a whole infrastructure for a venture for the event, testing, and manufacturing environments, utilizing simply variables to alternate between them.

With out additional speaking, all that rests is to create the Glue job itself, and there’s no novelty in that:

useful resource "aws_glue_job" "myjob" {
identify = "myjob"
role_arn = aws_iam_role.glue_execution_role.arn
glue_version = "4.0"
command {
script_location = "s3://${}/"
default_arguments = {
"--job-language" = "python"
"--job-bookmark-option" = "job-bookmark-disable"
"--enable-metrics" = ""
depends_on = [aws_s3_object.myjob]

And our infrastructure is completed. Run terraform apply to create the remaining assets.

docker compose run terraform apply

And terraform destroy to eliminate the whole lot.

docker compose run terraform destroy

I met Terraform a couple of days after publishing my 2nd submit about creating information pipelines utilizing cloud suppliers, and it blew my thoughts. I immediately considered all of the handbook work that I did to arrange the venture, all of the prints captured to showcase the method and all of the undocumented particulars that can hang-out my nightmares after I want to breed the method.

Terraform solves all these issues. It’s easy, straightforward to arrange, and straightforward to make use of, all it wants are a couple of .tf information together with the suppliers’ credentials and we’re able to go.

Terraform tackles that type of drawback that individuals often don’t are so excited to consider. When creating information merchandise, all of us take into consideration efficiency, optimization, delay, high quality, accuracy, and different data-specific or domain-specific points of our product.

Don’t get me improper, all of us research to use our higher mathematical and computational data to unravel these issues, however we additionally want to consider essential points of the growth course of of our product, like reproducibility, maintainability, documentation, versioning, integration, modularization, and so forth.

These are points that our software program engineer colleagues have been involved about for a very long time, so we don’t should reinvent the wheel, simply study one factor or two from their finest practices.

That’s why I at all times use Docker in my initiatives and that’s additionally why I’ll in all probability add Terraform in my fundamental toolset.

I hope this submit helped you in understanding this instrument — Terraform — together with its aims, fundamental functionalities, and sensible advantages. As at all times, I’m not an knowledgeable in any of the themes addressed on this submit, and I strongly advocate additional studying, see some references beneath.

Thanks for studying! 😉

All of the code is obtainable in this GitHub repository.
Information used — ENEM PDFs, [CC BY-ND 3.0], MEC-Brazilian Gov.
All the photographs are created by the Writer, until in any other case specified.

[1] Add set off to AWS Lambda capabilities by way of Terraform. Stack Overflow. Link.
[2] AWSLambdaBasicExecutionRole — AWS Managed Coverage. Link.
[3] Brikman, Y. (2022, October 11). Terraform ideas & tips: loops, if-statements, and gotchas. Medium.
[4] Create Useful resource Dependencies | Terraform | HashiCorp Developer. Link.
[5] TechWorld with Nana. (2020, July 4). Terraform defined in 15 minutes | Terraform Tutorial for Newbies [Video]. YouTube.
[6] Terraform Registry. AWS supplier. Link.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button