Is this course really 100% online? Do I need to attend any classes in person?

This course is completely online, so there’s no need to show up to a classroom in person. You can access your lectures, readings and assignments anytime and anywhere via the web or your mobile device.

Can I just enroll in a single course?

Yes! To get started, click the course card that interests you and enroll. You can enroll and complete the course to earn a shareable certificate. When you subscribe to a course that is part of a Certificate, you’re automatically subscribed to the full Certificate. Visit your learner dashboard to track your progress.

Open source Data Engineering with Spark, dbt & Airflow Professional Certificate

Save on skills that make you shine with 40% off 3 months of Coursera Plus. Save now

Open source Data Engineering with Spark, dbt & Airflow Professional Certificate

Build Production Data Pipelines at Scale.

Explore Spark, dbt, and Airflow to design, automate, and deploy enterprise-grade data pipelines.

Instructor: Professionals from the Industry

Included with

Learn more

6 course series

Earn a career credential that demonstrates your expertise

Intermediate level

Recommended experience

4 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

6 course series

Earn a career credential that demonstrates your expertise

Intermediate level

Recommended experience

4 weeks to complete

at 10 hours a week

Flexible schedule

Learn at your own pace

What you'll learn

Build modular, production-grade data pipelines using Apache Spark, dbt, and Airflow to ingest, transform, and load data at scale.
Design and implement dimensional data models including star schemas, SCD Type 2, and incremental load strategies for data warehouses.
Optimize distributed data processing by resolving Spark shuffle, skew, and partitioning issues to improve pipeline performance.
Automate deployments and enforce data quality using CI/CD pipelines, Docker containers, and automated testing frameworks like Great Expectations.

Skills you'll gain

Tools you'll learn

Details to know

Shareable certificate

Add to your LinkedIn profile

Taught in English

See how employees at top companies are mastering in-demand skills

Learn more about Coursera for Business

logos of Petrobras, TATA, Danone, Capgemini, P&G and L'Oreal

Advance your career with in-demand skills

Receive professional-level training from Coursera
Demonstrate your technical proficiency
Earn an employer-recognized certificate from Coursera

Professional Certificate - 6 course series

This program equips you with the open-source tools and architectural thinking used by professional data engineers to build scalable, reliable data systems from the ground up. You will work hands-on with Apache Spark for distributed data processing, dbt for modular SQL-based transformation, and Apache Airflow for workflow orchestration — the same stack powering data infrastructure at leading technology and data-driven organizations worldwide.

Across the courses, you will gain practical expertise in designing dimensional data models, implementing incremental load strategies, optimizing Spark job performance, enforcing data quality with automated testing frameworks, and deploying pipelines through CI/CD workflows. You will also develop foundational skills in cloud storage provisioning, containerization with Docker, and version control best practices that mirror real production environments.

By the end of this Program, you will be able to design and deploy end-to-end data pipelines that ingest from diverse sources, transform data through well-tested models, and deliver analytics-ready datasets to downstream consumers — demonstrating job-ready engineering skills valued across analytics engineering, data platform, and data infrastructure roles.

Applied Learning Project

Throughout this Program, you will complete hands-on projects that mirror real production data engineering challenges — from building modular ETL pipelines that ingest CRM and streaming data into a cloud data warehouse, to authoring Airflow DAGs with retry logic and SLA monitoring, to diagnosing Spark performance bottlenecks and implementing Delta Lake versioning. Each project asks you to work in your own development environment, producing portfolio-ready artifacts that demonstrate your ability to design, optimize, and deploy reliable data infrastructure using open-source tools.

Building Automated Data Pipelines with Spark,dbt,and Airflow

Course 1, 9 hours

What you'll learn

Build end-to-end data pipelines that automatically ingest from databases, APIs, and streams using Spark, dbt, and Airflow tools.
Design data models with historical tracking using SCD Type 2 patterns to preserve complete change history for analytics.
Create automated workflows with intelligent retry logic, SLA monitoring, and parameterization for production reliability.
Optimize Spark job performance using partitioning and caching strategies to achieve 30%+ runtime improvements.

Skills you'll gain

Category: Data Pipelines

Category: Apache Airflow

Category: Data Transformation

Category: Apache Spark

Category: Data Flow Diagrams (DFDs)

Category: Data Modeling

Category: Data Mapping

Category: Data Processing

Category: Configuration Management

Category: Enterprise Security

Category: Extract, Transform, Load

Category: Data Integration

Category: Diagram Design

Category: Data Architecture

Category: Data Warehousing

Category: Database Development

Optimizing Spark and Cloud Data Storage for Analytics

Course 2, 10 hours

What you'll learn

Optimize Spark job performance through strategic partitioning and caching, achieving 30%+ runtime improvements using data access analysis.
Implement transactional data lakes with Delta format, enabling versioning, ACID operations, and schema evolution for reliable datasets.
Provision secure cloud data infrastructure using IAM policies, private networks, and encrypted storage following security best practices.
Evaluate and benchmark storage formats (Parquet, ORC, Avro) to select optimal solutions for analytical workloads and cost efficiency.

Skills you'll gain

Category: Apache Spark

Category: Performance Tuning

Category: Cloud Security

Category: Data Storage

Category: Transaction Processing

Category: Data Warehousing

Category: Data Lakes

Category: Data Storage Technologies

Category: Cloud Deployment

Category: Cloud Computing Architecture

Category: Security Controls

Category: PySpark

Category: Data Security

Category: Infrastructure Architecture

Category: Cloud Infrastructure

Category: Data Integrity

Category: Cloud Storage

Category: Cloud Computing

Category: Data Management

Category: Infrastructure as Code (IaC)

Data Modeling & Warehousing Fundamentals in Data Engineering

Course 3, 9 hours

What you'll learn

Design star schema data models with fact and dimension tables that enable intuitive self-service business intelligence reporting.
Apply third normal form normalization to optimize database structure while maintaining query performance through indexing strategies.
Use advanced SQL window functions to calculate rolling metrics, rankings, and time-series analytics for complex data analysis.
Implement database replication and incremental loading techniques to ensure high availability and efficient data warehouse updates.

Skills you'll gain

Category: SQL

Category: Data Warehousing

Category: Database Management

Category: Star Schema

Category: Extract, Transform, Load

Category: Performance Tuning

Category: Database Design

Category: Data Modeling

Category: PostgreSQL

Category: Database Development

Category: Data Infrastructure

Category: Business Intelligence

Category: Relational Databases

Category: Database Architecture and Administration

Category: Database Software

Category: Data Integration

Category: Database Theory

DevOps and CI/CD for Data Engineering Performance

Course 4, 12 hours

What you'll learn

Resolve merge conflicts and trace bugs using Git history tools, keeping collaborative codebases stable and production-ready.
Design branching strategies and automate deployments with CI/CD pipelines to safely promote data pipeline artifacts across environments.
Build and publish versioned Docker images and automate server configuration with Ansible for consistent, reproducible environments.
Analyze query execution metrics and optimize resource allocation to maintain performance targets in production data systems.

Skills you'll gain

Category: CI/CD

Category: DevOps

Category: Git (Version Control System)

Category: Performance Tuning

Category: Containerization

Category: Ansible

Category: Data Pipelines

Category: Software Versioning

Category: Application Deployment

Category: Docker (Software)

Category: Development Environment

Category: Devops Tools

Category: Root Cause Analysis

Category: Continuous Integration

Category: Continuous Deployment

Category: Data Infrastructure

Category: Infrastructure as Code (IaC)

Category: Version Control

Category: Configuration Management

Data Quality and Debugging for Reliable Pipelines

Course 5, 7 hours

What you'll learn

Define and automate data quality tests using YAML to validate row counts, null thresholds, and uniqueness across pipeline datasets.
Trace data anomalies through pipeline stages by analyzing logs and dashboards to identify and fix the exact source of failure.
Apply advanced Python debugging tools — including conditional breakpoints, watchpoints, and pdb — to diagnose and resolve pipeline issues.
Resolve complex concurrency bugs by reading stack traces and correlating thread logs to identify deadlocks and race conditions in code.

Skills you'll gain

Category: Data Quality

Category: Data Validation

Category: Debugging

Category: Anomaly Detection

Category: Data Integrity

Category: YAML

Category: Test Automation

Category: Root Cause Analysis

Category: Reliability

Category: Generative AI

Category: Memory Management

Category: Test Tools

Category: CI/CD

Category: Python Programming

Category: Data Pipelines

Category: AI Integrations

Category: Performance Tuning

Career Development For Open Source Data Engineering

Course 6, 2 hours

What you'll learn

Build a data engineering portfolio with end-to-end pipeline projects that prove your ability to design, build, and deploy production-style systems.
Create a resume, LinkedIn profile, and GitHub presence that position you as a hands-on data engineer ready to contribute from day one.
Practice real data engineering interview scenarios and develop structured responses to technical, design, and behavioral questions.
Execute a 30-day career launch plan covering portfolio completion, job applications, and networking in the data engineering community.

Skills you'll gain

Category: Apache

Category: Professional Networking

Category: Data Quality

Category: Data Presentation

Category: Interviewing Skills

Category: Web Presence

Category: SQL

Category: Portfolio Management

Category: Apache Airflow

Category: Apache Spark

Category: Python Programming

Category: GitHub

Category: Data Pipelines

Earn a career certificate

Add this credential to your LinkedIn profile, resume, or CV. Share it on social media and in your performance review.

Instructor

Professionals from the Industry

472 Courses81,077 learners

Offered by

Coursera

Why people choose Coursera for their career

Felipe M.

Learner since 2018

"To be able to take courses at my own pace and rhythm has been an amazing experience. I can learn whenever it fits my schedule and mood."

Jennifer J.

Learner since 2020

"I directly applied the concepts and skills I learned from my courses to an exciting new project at work."

Larry W.

Learner since 2021

"When I need courses on topics that my university doesn't offer, Coursera is one of the best places to go."

Chaitanya A.

"Learning isn't just about being better at your job: it's so much more than that. Coursera allows me to learn without limits."

Open new doors with Coursera Plus

Unlimited access to 10,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Learn more

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Explore degrees

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Learn more

Frequently asked questions

This Program is designed for intermediate learners. You should be comfortable writing Python scripts and SQL queries before starting. Prior experience with data engineering tools like Spark or Airflow is not required — you will build that knowledge through the courses.

You will work in your own local or cloud-based development environment using open-source tools including Apache Spark, dbt Core, Apache Airflow, Docker, and Git. Specific setup instructions are provided at the start of each course.

This program is designed for aspiring data engineers and technically curious professionals who want to build a career working with data infrastructure and pipelines. It is well-suited for software developers transitioning into data engineering, analysts looking to move beyond spreadsheets and SQL into pipeline development, and recent graduates seeking job-ready, hands-on data engineering skills.

Basic Python familiarity and foundational SQL knowledge — such as writing simple SELECT and JOIN queries — are recommended before starting. General comfort working in a command-line environment will also be helpful. No prior experience with Spark, dbt, Airflow, Docker, or cloud platforms is required. The program builds all data engineering skills from the ground up.