Grow Data SKills | Most Affordable Courses | Become Top Data Professional

Light
Dark

Complete Data Engineering With AWS - Basic To Advance

Data Engineering With AWS

By Shashank In Grow Data Skills

Self Paced
English
Completion Certificate

Demo Class Video

Data Engineering With AWS

1 Year

Validity (From the date of Enrollment)

160 Hours

Duration

40

Sessions

16

Projects

INR 7600

This course includes

Content Duration160 Hours
Total Video Sessions40
Total Industry Projects16
GCP & AWS Cloud
Quality Assignments & quizzes after each module
Interview preparation guide
Dedicated placement assistance
Resume & Linkedin profile making
Doubt support on private discord community
Certificate of completion

Show More

Tech stack you'll learn

Data Engineering With AWS

Course Content

✅ Class - 1

What is Database?
Difference between Transactional Databases and NoSQL databases
What is DBMS & RDBMS?
Transactions & ACID Properties
Setup MySQL Workbench
Setup MySQL Using Docker
DDL, DML, DQL, DCL
CREATE Command
INSERT Command
Integrity Constraints

✅ Class - 2

Alter Command
Drop, Truncate and Delete
Primary Key vs Foreign Key
Referential Integrity
Select Query, In-Built Functions, Aliases
UPDATE Command
Auto Increment in create table
Limit
Order By Clause
Conditional Operators
Logical Operators
Like Operation
User Defined Functions (UDFs)

✅ Class - 3

IS NULL, IS NOT NULL
Group By, Having Clause
Group Concat, Group RollUP
Sub Queries, IN and NOT IN
CASE-When
SQL Joins

✅ Class - 4

Exists and Not Exists
Window Functions
Frame Clause
Coalesce Function
Common Table Expressions - Iterative and Recursive

✅ Class - 1

BigData Fundamentals
5 V’s of BigData
Distributed Computation
Distributed Storage
Cluster, Commodity Hardware
File Formats
Types of Data
History of Hadoop
Hadoop Architecture & Components

✅ Class - 2

Map-Reduce Architecture
YARN Architecture

✅ Class - 3

Hive Complete Architecture
Hadoop Cluster Setup on GCP (Dataproc)

✅ Class - 4

Data Types in Hive
Create Database
Create Table
Load Data From Local
Load Data From HDFS
Internal Table
External Table
Array & Map Data Types
SerDe in Hive
File Formats in Hive - ORC, Parquet, Avro

✅ Class - 5

CSV SerDe
JSON SerDe
Parquet SerDe
ORC SerDe
Static Partitioning
Dynamic Partitioning
Bucketing
Map-Side Join, Bucket Map Join, Sorted Merge Join, Skew Join

✅ Class - 1

Kafka Cluster Architecture
Brokers
Topics
Partitions
Producer-Consumer, Consumer Group
Offset Management
Replicas
Commits
Sync & Async Commits

✅ Class - 2

Confluent Kafka Setup
Topic Creation
Schema Registry
Key, Value Message
Message in Kafka Topics based on Random and Constant Keys
Kafka Producer Code with Serialisation
Kafka Consumer Code with De-Serialization
Consumer Groups
Working with JSON, CSV Data
GCP Pub-Sub Setup
Producer & Consumer for GCP Pub-Sub Setup

✅ Class - 1

CAP Theorem
What is MongoDB and MongoDB Atlas?
MongoDB vs Relational Database
MongoDB features
MongoDB use cases and applications
MongoDB architecture
Node
Data Centre
Cluster
Data replication
Write operation
Read operation
Indexing

✅ Class - 2

MongoDB Atlas Setup
MongoDB Cluster Creation
MongoDB Compass Setup
Database & Collection in MongoDB
Connect with MongoDB Cluster from MongoDB Compass
Import JSON data in MongoDB Collection
Queries on MongoDB Collection from Python Application
KSQLdb in Confluent Kafka
Streams in KSQLdb
Tables in KSQLdb
Persistent Queries in KSQLdb
JOIN queries on streams in KSQLdb
McDonald's Payments Stream data ingestion from Kafka to MongoDB
Setup Orders & Payments Streams using KSQLdb
Setup windowed JOIN streams using KSQLdb
Setup MongoDB Sink Connector

✅ Class - 3

CAP Theorem
What is Apache Cassandra?
Cassandra Database vs Relational Database
Apache Cassandra features
Cassandra use cases and applications
Cassandra architecture
Node
Data Centre
Cluster
Commit log
Mem-table
SSTable
Data replication
Read operation

✅ Class - 4

Data Partitioning and Token
VNodes in Cassandra
Read Operation in Cassandra
Compaction in Cassandra
Gossip Protocol in Cassandra
Write consistency in Cassandra
Read consistency in Cassandra
Partition Key, Cluster Key, Row Key Declaration
Cassandra Setup Using Docker
CQL in Cassandra
Cassandra Free Tier Setup On DataStax
Queries in Cassandra using Python

✅ Class - 1

Problems with Hadoop Map-Reduce
What is Apache Spark?
Features of Spark
Spark ecosystem
RDD in Spark
Properties of RDD
How Spark perform data partitioning?
Transformation in Spark
Narrow Transformation vs Wide Transformation
Action in Spark
Read & Write operation in Spark are transformation or action?
Lazy evaluation in Spark
Lineage graph or DAG in Spark
How DAG looks on Spark Web UI?
Job, Stage and Task in Spark
What if Spark cluster capacity is less than the size of data to be processed?
Spark in-depth architecture and it's components
Spark with Standalone Cluster Manager Type
Spark with YARN Cluster Manager Type
Deployment modes of Spark Application
How DAG looks on Spark Web UI?
Internals of Spark Job over the cluster

✅ Class - 2

Persist and Caching in Spark
Storage Levels in Persist
How does data skewness occur in Spark?
Techniques to deal with data skewness
Repartition vs Coalesce
Example of Key Salting technique
RDD vs Dataframe vs Dataset
How to use Spark-Submit utility?
Memory management in Spark
Memory components in Executor Container
Dynamic occupancy mechanism
How to process 1 TB of data in Spark?
Resource allocation case study - 1 : 6 Nodes and each node have 16 cores & 64 GB RAM
Resource allocation case study - 2 : 6 Nodes and each node have 32 cores & 64 GB RAM
Resource allocation case study - 3 : When more memory isn't required for the executors
Broadcast and Accumulators in Spark
Different type of failures in Spark and how to resolve them
Out Of Memory failures
Code and Resource level optimizations in Spark
Best practices to design Spark Applications

✅ Class - 3

Spark Cluster Setup On GCP Dataproc
Spark Session creation
Create dataframe with custom schema
Read csv data from HDFS
Partitions and Partition size in Spark job
Select operation
withColumn operation
withColumnRenamed operation
Filter operation
Drop column operation
Drop Duplicates operation
Order By operation
Group By operation
Accumlator
Case-When operation
Window functions
Join & Broadcast join operation
Spark SQL, Register dataframe as table
Write CSV data in HDFS without partition key
Write Parquet data in HDFS
Write CSV data in HDFS with partition key
Write CSV data in HDFS with Coalesce
Read JSON Data in Spark and explode columns

✅ Class - 4

Execution of Spark application using Spark-Submit Utility
Monitor, Debug & Understand Spark Dag on Spark Web UI - Practical Example
What is Stream Processing?
Spark structured streaming
Spark streaming with word count example
Output modes in writeStream in Spark structured streaming
What if memory due to state management is full?
DStream vs Spark Structured Streaming
Spark structured streaming with File as source
Triggers in Spark structured streaming

✅ Class - 5

Checkpointing
Exactly once in spark structured streaming
Stateless and Stateful Processing
Global aggregation and Windowed aggregation
Windowing
Sliding window
Tumbling window vs Sliding window
When and why we should use windowing?
Windowed aggregations example
Arbitrary stateful transformations
Watermarking
Working example of handling delayed events using watermarking
Code implementation for Stateless spark structured streaming with source as Confluent Kafka Topic
Code implementation for Stateful spark structured streaming with source as Confluent Kafka Topic - Global aggregation and Windowed aggregation
Spark structure streaming pipeline implementation where Source is Confluent Kafka Topic and Destination is MongoDB

✅ Class - 1

What is orchestration in BigData?
Need of dependency management in Data Pipeline design
What is Airflow?
Architecture & Different Components of Airflow
Operators in Airflow
How to write Airflow DAG Scripts?
Attribute description
How to execute parallel tasks?

✅ Class - 2

Setup Airflow on GCP using Composer
Create and schedule Airflow Dag with sequential tasks using BashOperator and PythonOperator
Create and schedule Airflow dag with parallel tasks using BashOperator and PythonOperator
Airflow Exercise - 1 : End-To-End Airflow Dag to Create Dataproc Cluster, Run PySpark Job on cluster and Delete GCP Dataproc cluster
Airflow Exercise - 2 : Airflow Dag to support data backfilling via parameterized date inputs and use if Variables in Airflow
Project 1 - Flight Booking Data Pipeline with Airflow & CICD (Industrial Project)

Tech Stack - GitHub, GitHub Actions, Google Storage, PySpark, Dataproc Serverless, Airflow, BigQuery

✅ Class - 1

What is Databricks?
Unity Catalog
Delta Lake & Delta Tables
Databricks Account Setup on GCP
Workspace Setup
Metastore Setup
Managed & External Catalog Setup
Volumes In Databricks
Databricks Cluster Setup
PySpark Notebook Setup
Read/Write from Databricks Volume in PySpark Notebook
Create Delta table and Write data in PySpark using DeltaTable Python API
Write partitioned data in Delta table
Read from Delta table in PySpark
Time travel in Delta table (Read from specific version or timestamp)

✅ Class - 2

Project 1 - Order Tracking Event Driven Data Ingestion (Industrial Project)

Tech Stack: Google Storage, PySpark, Databricks, Delta Lake, Databricks Workflows, GitHub

Project 2 - UPI Transactions Real Time CDC Feed Processing (Industrial Project)

Tech Stack - Databricks, Spark Structured Streaming, Delta Lake

Project 3 - Travel Bookings Data Ingestion Pipeline With SCD2 Merge(Industrial Project)

Tech Stack - Databricks, PySpark, Google Storage, Delta Lake, Databricks Workflows, PyDeequ
What is DLT in Databricks?
How to create materialized views & streaming tables with DLT pipeline?
How to setup DLT pipeline job?
Validation & Execution of DLT pipeline with lineage
Checkpointing in DLT pipeline
Project 4 - Healthcare Delta Live Table Pipeline with Medallion Architecture (Industrial Project)

Tech Stack - Databricks, PySpark, Delta Lake, Delta Live Table Job

✅ Class - 1

OLAP vs OLTP
What is a Data Warehouse?
Difference between Data Warehouse, Data Lake and Data Mart
Fact Tables
Dimension Tables
Slowly changing Dimensions
Types of SCDs
Star Schema Design
Snowflake Schema Design
Galaxy Schema Design

✅ Class - 2

Case Study - 1: Uber Data Warehouse Design Case Study
Case Study - 2: AirBnB Data Warehouse Design Case Study

✅ Class - 1

Snowflake free tier account setup
Snowflake UI walkthrough
Load data from UI and create snowflake table
Hands On - Event driven data ingestion in snowflake table using SnowPipe

Tech Stack Used : Google Storage Bucket, GCP Pub-Sub, Snowflake
How to create and schedule task in snowflake

✅ Class - 2

Project - 1: News Data Analysis with event driven incremental load in Snowflake table(Industrial Project)

Tech Stack: Airflow, Google Cloud Storage, Python, Snowflake
Project - 2: Movie Booking CDC data real time aggregation in Snowflake Dynamic Table(Industrial Project)

Tech Stack: Python, Snowflake Dynamic Table, Snowflake Stream, Snowflake Tasks, Streamlit
Project - 3: Car rental data batch ingestion with SCD2 merge in snowflake table(Industrial Project)

Tech Stack: Python, PySpark, GCP Dataproc, Airflow, Snowflake

✅ Class - 3

BigQuery Overview
BigQuery Architecture
Capacitor — Columnar format
Colossus — Storage
Dremel — Execution Engine
Borg — Compute
Jupiter — Network
Project - 1: IRCTC Streaming data ingestion into BigQuery(Industrial Project)

Tech Stack: Python GCP Storage, GCP Pub-Sub, BigQuery, Dataflow
Project - 2: Walmart data ingestion into BigQuery(Industrial Project)

Tech Stack: Python GCP Storage, Airflow, BigQuery

AWS Services Covered

Event Bridge Scheduler, Event Bridge Pipe, Kinesis, Kinesis Firehose, DynamoDB, SNS, SQS

S3, Lambda, IAM, CLOUDWATCH, EC2, SNS, SQS

Step Function, GLUE, RDS, ATHENA, REDSHIFT

✅ Class - 1

AWS Free Tier Account Setup
AWS Console Walkthrough
S3 Bucket Creation
AWS CLI Setup
IAM User Setup
Access S3 Buckets using AWS CLI
S3 Bucket ARN
AWS Lambda Basics
Create Hello World Lambda function with Python
Execution and Testing of Lambda Function
Trigger Lambda Function with S3 Create Object Notification
Deployment of Lambda Functions with other dependencies
How to create and use Layers in Lambda

✅ Class - 2

Read data from S3 file in Lambda Function with event driven notification & boto3 library
AWS SNS Basics
Create topics in SNS
Setup Email subscription of SNS topic
S3 Create object notification to SNS topic
Publish custom messages in SNS topic from Lambda function
AWS SQS Basics
SQS vs Kafka
Create SQS in AWS
Send and Receive messages in SQS
Read stream of messages in Lambda Function from SQS
AWS Event Bridge & Event Bridge Pipe
Scheduled trigger of Lambda function using event Bridge
Event bridge pipe to read stream of data from SQS and send to Lambda function with intermediate filters

✅ Class - 3

Create EC2 instance in AWS
SSH in EC2 machine from terminal
AWS RDS
Setup MySQL database with AWS RDS
Login & Access MySQL Database from terminal
Connect and manipulate data in MySQL database using Python
AWS Athena Basics
Athena vs Spark
Create & Query Athena Tables
Setup Datasources in AWS Glue Catalog
Table metadata preparation with AWS Glue Crawler
Run Athena queries from Lambda Function

✅ Class - 4

AWS Redshift fundamentals & architecture
Setup Redshift cluster
Table operations on sample data in redshift
Load data from S3 into Redshift table
Unload query command in Redshift
Unload data from Redshift into S3 with Manifest file
Create external table in Redshift (Redshift Spectrum)
Materialized views in Redshift
AWS Glue fundamentals & components
AWS Glue Catalog & Glue Crawler
Create table in AWS Glue Catalog by crawling partitioned data in S3
Setup Redshift connector in Glue
Data pipeline using AWS Glue Visual ETL with S3 as Source and Redshift as Destination
AWS Glue job execution and insights

Challenges with traditional data lake storages?
What is open table format?
Challenges solved by open table formats?
What is small file problem?
How open table formats solved small file problem?
Apache Iceberg Overview
Apache Iceberg Architecture Overview
How Iceberg catalog works?
Metadata layer - Metadata files, Manifest lists and Manifest files
Data Layer - Data Files
Backend representation of Iceberg tables after CRUD operations
After Create Table Command
After Insert Command
After MERGE INTO / UPSERT Command
Copy-On-Write Approach (CoW)
Delete files
Positional Delete Files
Equality Delete Files
Merge-on-Read Approach (MoR)
How to choose between COW and MOR?
Behind the scene process while running Select commands on Iceberg tables
Create Table Command in Iceberg
Insert Command in Iceberg
Delete & Update Commands in Iceberg
Alter Command in Iceberg
Merge Query in Iceberg
Time travel Query (Read) in Iceberg
Compaction in Iceberg
Data pipeline with Iceberg on AWS & Snowflake (Case Study)
Medallion Architecture with Iceberg (Case Study)
Apache Hudi Overview
Apache Hudi Architecture
Data Sources
Hudi Core - ACID Guarantees, Incremental Pipelines, Multimodal Indexes, Managed Tables
Lakehouse Platform
Metadata
Data Sinks
Apache Hudi Storage Layout - Base Path, Meta Path, Partition Paths, Data Files
Apache Hudi Query Types
Snapshot Queries
Time Travel Queries
Read Optimized Queries (Only MoR tables)
Incremental Queries (Latest State)
Incremental Queries(CDC)
Create Table Command in Hudi
Alter Command in Hudi
Insert, Update & Delete Command in Hudi
Merge Query in Hudi
Time Travel QUERIES in Hudi
Case Study of Data pipeline of Apna with Hudi
Develop and execute PySpark application in AWS Glue environment for Write
Configure Glue catalog for Iceberg tables in Spark session along with S3 warehouse path
Build dummy dataframe and use createOrReplace() dataframe v2 api to create Iceberg table with partitions & other table properties
Perform append() operation on Iceberg table
Perform overwritePartitions() operation on Iceberg table
Perform merge query on Iceberg table
Understanding of iceberg metadata file after each write operation
Develop and execute PySpark application in AWS Glue environment for Read
Read iceberg table data for latest snapshot
Query snapshot information of Iceberg table
Read iceberg table for specific snapshot_id
Read iceberg table for specific committed timestamp

Overview of Stream Processing
Introduction to Apache Flink
Apache Flink APIs
DataStream API (Core API)
Table API (Declarative DSL)
SQL API (High-Level Language)
Programs and Dataflows in Apache Flink
Source
Transformations
Sink
Dataflow Graph - Streams, Operators
Apache Flink Architecture
Job Manager - Resource Manager, Dispatcher, Job Master
Task Managers
Operators
Tasks
Parallelism
Operator Chaining
Task Slots
Calculate Total Tasks on a Cluster
Task Slots on Task Manager
State Management in Flink
State Backend - HashMapStateBackend, RocksDBStateBackend
How to configure State Backend
Checkpointing in Flink
How Does Checkpointing Work?
Types of Checkpointing - Full Checkpointing, Incremental Checkpointing
Checkpointing Modes - EXACTLY_ONCE, AT_LEAST_ONCE
How to configure checkpointing in the code?
How to configure checkpointing in the config file?
Savepointing in Flink
Why is Savepointing Needed?
Savepoint Lifecycle
What is Backpressure in Flink?
How Backpressure Occurs?
How Backpressure Works in Flink?
Impact of Backpressure on Checkpointing
How to Tackle Backpressure in Apache Flink?
Complete setup and configuration of Flink cluster in local dev environment
Complete walkthrough of Flink UI
Create and execute PyFlink Stateless Data Pipeline in local using DataStream APIs
Source = Confluent Kafka Topic
Transformations = Map & Filters
Sink = Confluent Kafka Topic

✅ Project - 1: Flight Booking Data Pipeline with Airflow & CICD (Covered In Module 6)

Tech Stack - GitHub, GitHub Actions, Google Storage, PySpark, Dataproc Serverless, Airflow, BigQuery
✅ Project - 2: Order Tracking Event Driven Data Ingestion (Covered In Module 7)

Tech Stack - Google Storage, PySpark, Databricks, Delta Lake, Databricks Workflows, GitHub
✅ Project - 3: UPI Transactions Real Time CDC Feed Processing (Covered In Module 7)

Tech Stack - Databricks, Spark Structured Streaming, Delta Lake
✅ Project - 4: Travel Bookings Data Ingestion Pipeline With SCD2 Merge (Covered In Module 7)

Tech Stack - Databricks, PySpark, Delta Lake, Delta Live Table Job
✅ Project - 5: Healthcare Delta Live Table Pipeline with Medallion Architecture (Covered In Module 7)

Tech Stack - Databricks, PySpark, Delta Lake, Delta Live Table Job
✅ Project - 6: News Data Analysis with Event-Driven Incremental Load in Snowflake Table (Cover In Module 9)

Tech Stack: Airflow, Google Cloud Storage, Python, Snowflake
✅ Project - 7: Movie Booking CDC data real time aggregation in Snowflake Dynamic Table (Cover In Module 9)

Tech Stack: Python, Snowflake Dynamic Table, Snowflake Stream, Snowflake Tasks, Streamlit
✅ Project - 8: Car Rental Data Batch Ingestion with SCD2 Merge in Snowflake Table (Cover In Module 9)

Tech Stack: Python, PySpark, GCP Dataproc, Airflow, Snowflake
✅ Project - 9: IRCTC Streaming Data Ingestion into BigQuery (Cover In Module 9)

Tech Stack: Python, GCP Storage, GCP Pub-Sub, BigQuery, Dataflow
✅ Project - 10: Walmart Data Ingestion into BigQuery (Cover In Module 9)

Tech Stack: Python, Airflow, GCP Storage, BigQuery
✅ Project - 11: Ad Tech Real Time Data Analysis Project

Tech Stack: Python, AWS Kinesis, AWS Managed Flink, AWS Glue, Spark Streaming, Apache Iceberg, AWS S3, Glue Catalog, AWS Athena
✅ Project - 12: Betting App Real Time Data Analysis Project

Tech Stack: Python, AWS Kinesis, AWS Managed Flink, AWS Data Firehose, AWS S3, Glue Catalog, AWS Athena
✅ Project - 13: Quality Movie Data Analysis Project

Tech Stack: S3, Glue Crawler, Glue Catalog, Glue Catalog Data Quality, Glue Low Code ETL, Redshift, Event Bridge, SNS
✅ Project - 14: Airline Data Ingestion Incrementally

Tech Stack: Python, AWS S3, AWS Step Function, AWS Glue, AWS Glue Crawler, AWS Glue Catalog, AWS Redshift, AWS Event Bridge, AWS SNS
✅ Project - 15: Crypto Data Analysis Near Realtime Data Pipeline

Tech Stack: Python, AWS DynamoDB, AWS Kinesis, AWS Data Firehose, AWS Lambda, AWS S3, AWS Glue, AWS Glue
✅ Project - 16: Credit Card Transactional Analysis For Fraud Risk

Tech Stack: Python, PySpark, Google Storage, GCP Dataproc Serverless, GCP BigQuery, GCP Composer (Airflow), PyTest, GitHub, GitHub Actions (For CI/CD)

Attention Seeking Resume Preparation and Interview Strategies
Strategies To Crack Tech Interviews
Linkedin Profile Making
How To Expand Your Professional Network On Linkedin
How To Use Various Job Portals
How To Approach For Referrals

Course Schedule

Mode Of The Course:

Recorded

Course Duration:

160 Hours

Total Sessions:

40

Total Projects:

16

Validity:

1 Year (Starting From The Date Of Enrollment)

Class Recording Provided:

Yes

Programming Language Used:

Python

Prerequisite:

Python

⚠️ Important Notice :

The video may not work on Linux due to DRM restrictions. It is only accessible on Chrome when using Windows or macOS.

Workaround: To access the video on Linux, you can create a Windows virtual machine (VM) and watch the video through the VM. Alternatively, you can use our Android or iOS application to view the video on your mobile device.

Instructor

Data Engineering With AWS

Shashank Mishra

Shashank Mishra is a seasoned Data Engineer with over 7 years of experience at top companies like Expedia, Amazon, PayTM, and McKinsey & Company. He specializes in Big Data, Cloud, and architecting scalable data pipelines across industries. A proud MCA graduate from NIT Allahabad, Shashank is passionate about sharing his expertise. Through his YouTube channel, E-Learning Bridge (177k+ subscribers), and LinkedIn (175k+ followers), he has mentored over 14,000 aspiring data professionals, helping them launch successful careers in Data Engineering.

Complete Data Engineering With AWS - Basic To Advance

INR 7600

Copyright © 2024 Regex Data Learning Pvt Ltd. All Rights Reserved.

Enquiry