Data Engineering 4.0 With AWS - Basic To Advance (Live Classes)


  • Image Icon

    LanguageEnglish

  • Image Icon

    Sessions 26

  • Image Icon

    Duration 3 Months

  • Image Icon

    Starts On In Progress

  • Image Icon

    Validity 1 Year

  • Image Icon

    Mode Live


Shape Images
Shape Images
Shape Images
INR 7500

Admission Open

This course includes
  • Content Duration - 100+ Hours
  • 12 Industry Projects (End-to-End Implementation)
  • Hands on Exercises, Quizzes & Interview Preparation Material
  • Placement Assistance
  • Resume & Linkedin Profile Making
  • Private Discord Community For Networking
  • Live & Offline Doubt Support
  • Certificate of Completion
Show More

Tech stack you'll learn

  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image
  • Brand Image

Course Content

  • Class - 1 (Live Classes)
    • What is Database?
    • Difference between Transactional Databases and NoSQL databases
    • What is DBMS & RDBMS?
    • Transactions & ACID Properties
    • Setup MySQL Workbench
    • Setup MySQL Using Docker
    • DDL, DML, DQL, DCL
    • CREATE Command
    • INSERT Command
    • Integrity Constraints

  • Class - 2 (Live Classes)
    • Alter Command
    • Drop, Truncate and Delete
    • Primary Key vs Foreign Key
    • Referential Integrity
    • Select Query, In-Built Functions, Aliases
    • UPDATE Command
    • Auto Increment in create table
    • Limit
    • Order By Clause
    • Conditional Operators
    • Logical Operators
    • Like Operation
    • User Defined Functions (UDFs)

  • Class - 3 (Live Classes)
    • IS NULL, IS NOT NULL
    • Group By, Having Clause
    • Group Concat, Group RollUP
    • Sub Queries, IN and NOT IN
    • CASE-When
    • SQL Joins

  • Class - 4 (Live Classes)
    • Exists and Not Exists
    • Window Functions
    • Frame Clause
    • Coalesce Function
    • Common Table Expressions - Iterative and Recursive

  • Class - 1 (Only Theory Part Recorded)
    • BigData Fundamentals
    • 5 V’s of BigData
    • Distributed Computation
    • Distributed Storage
    • Cluster, Commodity Hardware
    • File Formats
    • Types of Data
    • History of Hadoop
    • Hadoop Architecture & Components

  • Class - 2 (Only Theory Part Recorded)
    • Map-Reduce Architecture
    • YARN Architecture

  • Class - 3 (Only Theory Part Recorded)
    • Hive Complete Architecture
    • Hadoop Cluster Setup on GCP (Dataproc)

  • Class - 4 (Live Classes)
    • Data Types in Hive
    • Create Database
    • Create Table
    • Load Data From Local
    • Load Data From HDFS
    • Internal Table
    • External Table
    • Array & Map Data Types
    • SerDe in Hive
    • File Formats in Hive - ORC, Parquet, Avro

  • Class - 5 (Live Classes)
    • CSV SerDe
    • JSON SerDe
    • Parquet SerDe
    • ORC SerDe
    • Static Partitioning
    • Dynamic Partitioning
    • Bucketing
    • Map-Side Join, Bucket Map Join, Sorted Merge Join, Skew Join

  • Class - 1 (Only Theory Part Recorded)
    • Kafka Cluster Architecture
    • Brokers
    • Topics
    • Partitions
    • Producer-Consumer, Consumer Group
    • Offset Management
    • Replicas
    • Commits
    • Sync & Async Commits

  • Class - 2 (Live Classes)
    • Confluent Kafka Setup
    • Topic Creation
    • Schema Registry
    • Key, Value Message
    • Message in Kafka Topics based on Random and Constant Keys
    • Kafka Producer Code with Serialisation
    • Kafka Consumer Code with De-Serialization
    • Consumer Groups
    • Working with JSON, CSV Data
    • GCP Pub-Sub Setup
    • Producer & Consumer for GCP Pub-Sub Setup

  • Class - 1
    • CAP Theorem
    • What is MongoDB and MongoDB Atlas?
    • MongoDB vs Relational Database
    • MongoDB features
    • MongoDB use cases and applications
    • MongoDB architecture
    • Node
    • Data Centre
    • Cluster
    • Data replication
    • Write operation
    • Read operation
    • Indexing

  • Class - 2
    • MongoDB Atlas Setup
    • MongoDB Cluster Creation
    • MongoDB Compass Setup
    • Database & Collection in MongoDB
    • Connect with MongoDB Cluster from MongoDB Compass
    • Import JSON data in MongoDB Collection
    • Queries on MongoDB Collection from Python Application
    • KSQLdb in Confluent Kafka
    • Streams in KSQLdb
    • Tables in KSQLdb
    • Persistent Queries in KSQLdb
    • JOIN queries on streams in KSQLdb
    • McDonald's Payments Stream data ingestion from Kafka to MongoDB
      • Setup Orders & Payments Streams using KSQLdb
      • Setup windowed JOIN streams using KSQLdb
      • Setup MongoDB Sink Connector

  • Class - 3
    • CAP Theorem
    • What is Apache Cassandra?
    • Cassandra Database vs Relational Database
    • Apache Cassandra features
    • Cassandra use cases and applications
    • Cassandra architecture
    • Node
    • Data Centre
    • Cluster
    • Commit log
    • Mem-table
    • SSTable
    • Data replication
    • Read operation

  • Class - 4
    • Data Partitioning and Token
    • VNodes in Cassandra
    • Read Operation in Cassandra
    • Compaction in Cassandra
    • Gossip Protocol in Cassandra
    • Write consistency in Cassandra
    • Read consistency in Cassandra
    • Partition Key, Cluster Key, Row Key Declaration
    • Cassandra Setup Using Docker
    • CQL in Cassandra
    • Cassandra Free Tier Setup On DataStax
    • Queries in Cassandra using Python

  • Class - 1 (Only Theory Part Recorded)
    • Problems with Hadoop Map-Reduce
    • What is Apache Spark?
    • Features of Spark
    • Spark ecosystem
    • RDD in Spark
    • Properties of RDD
    • How Spark perform data partitioning?
    • Transformation in Spark
    • Narrow Transformation vs Wide Transformation
    • Action in Spark
    • Read & Write operation in Spark are transformation or action?
    • Lazy evaluation in Spark
    • Lineage graph or DAG in Spark
    • How DAG looks on Spark Web UI?
    • Job, Stage and Task in Spark
    • What if Spark cluster capacity is less than the size of data to be processed?
    • Spark in-depth architecture and it's components
    • Spark with Standalone Cluster Manager Type
    • Spark with YARN Cluster Manager Type
    • Deployment modes of Spark Application
    • How DAG looks on Spark Web UI?
    • Internals of Spark Job over the cluster

  • Class - 2 (Only Theory Part Recorded)
    • Persist and Caching in Spark
    • Storage Levels in Persist
    • How does data skewness occur in Spark?
    • Techniques to deal with data skewness
    • Repartition vs Coalesce
    • Example of Key Salting technique
    • RDD vs Dataframe vs Dataset
    • How to use Spark-Submit utility?
    • Memory management in Spark
    • Memory components in Executor Container
    • Dynamic occupancy mechanism
    • How to process 1 TB of data in Spark?
    • Resource allocation case study - 1 : 6 Nodes and each node have 16 cores & 64 GB RAM
    • Resource allocation case study - 2 : 6 Nodes and each node have 32 cores & 64 GB RAM
    • Resource allocation case study - 3 : When more memory isn't required for the executors
    • Broadcast and Accumulators in Spark
    • Different type of failures in Spark and how to resolve them
    • Out Of Memory failures
    • Code and Resource level optimizations in Spark
    • Best practices to design Spark Applications

  • Class - 3 (Live Classes)
    • Spark Cluster Setup On GCP Dataproc
    • Spark Session creation
    • Create dataframe with custom schema
    • Read csv data from HDFS
    • Partitions and Partition size in Spark job
    • Select operation
    • withColumn operation
    • withColumnRenamed operation
    • Filter operation
    • Drop column operation
    • Drop Duplicates operation
    • Order By operation
    • Group By operation
    • Accumlator
    • Case-When operation
    • Window functions
    • Join & Broadcast join operation
    • Spark SQL, Register dataframe as table
    • Write CSV data in HDFS without partition key
    • Write Parquet data in HDFS
    • Write CSV data in HDFS with partition key
    • Write CSV data in HDFS with Coalesce
    • Read JSON Data in Spark and explode columns

  • Class - 4 (Live Classes)
    • Execution of Spark application using Spark-Submit Utility
    • Monitor, Debug & Understand Spark Dag on Spark Web UI - Practical Example
    • What is Stream Processing?
    • Spark structured streaming
    • Spark streaming with word count example
    • Output modes in writeStream in Spark structured streaming
    • What if memory due to state management is full?
    • DStream vs Spark Structured Streaming
    • Spark structured streaming with File as source
    • Triggers in Spark structured streaming

  • Class - 5 (Live Classes)
    • Checkpointing
    • Exactly once in spark structured streaming
    • Stateless and Stateful Processing
    • Global aggregation and Windowed aggregation
    • Windowing
    • Sliding window
    • Tumbling window vs Sliding window
    • When and why we should use windowing?
    • Windowed aggregations example
    • Arbitrary stateful transformations
    • Watermarking
    • Working example of handling delayed events using watermarking
    • Code implementation for Stateless spark structured streaming with source as Confluent Kafka Topic
    • Code implementation for Stateful spark structured streaming with source as Confluent Kafka Topic - Global aggregation and Windowed aggregation
    • Spark structure streaming pipeline implementation where Source is Confluent Kafka Topic and Destination is MongoDB

  • Class - 1 (Only Theory Part Recorded)
    • What is orchestration in BigData?
    • Need of dependency management in Data Pipeline design
    • What is Airflow?
    • Architecture & Different Components of Airflow
    • Operators in Airflow
    • How to write Airflow DAG Scripts?
    • Attribute description
    • How to execute parallel tasks?

  • Class - 2 (Live Classes)
    • Setup Airflow on GCP using Composer
    • Create and schedule Airflow dag with sequential tasks using BashOperator and PythonOperator
    • Create and schedule Airflow dag with parallel tasks using BashOperator and PythonOperator
    • Airflow Project - 1 : End-To-End Airflow dag to Create, Run PySpark Job and Destroy GCP Dataproc cluster
    • Airflow Project - 2 : Airflow dag to use user defined variables and pass external config parameters

  • Class - 1 (Live Classes)
    • What is Databricks?
    • Databricks Architecture, Delta Lake & Delta Tables
    • Databricks Account Setup on GCP
    • Workspace setup
    • Compute in Databricks & Spark cluster setup
    • Unity catalog
    • Spark Job Execution on Databricks Cluster
    • Workflow Creation in Databricks
    • Project - Incremental Logistics Data Ingestion and perform merge operation in Delta tables

  • Class - 2 (Live Classes)
    • Project - 1: Real time healthcare data processing with DLT (Delta Live Tables) in Databricks
    • Tech Stack:
      • PySpark
      • Databricks
      • Delta Tables
      • Databricks DLT Workflow
    • Project - 2: Booking.com incremental SCD2 Merge ingestion
    • Tech Stack:
      • PySpark
      • Databricks
      • Delta Tables
      • Databricks DLT Workflow
      • PyDeequ

  • Class - 1
    • OLAP vs OLTP
    • What is a Data Warehouse?
    • Difference between Data Warehouse, Data Lake and Data Mart
    • Fact Tables
    • Dimension Tables
    • Slowly changing Dimensions
    • Types of SCDs
    • Star Schema Design
    • Snowflake Schema Design
    • Galaxy Schema Design

  • Class - 2
    • Uber Data Warehouse Design Case Study
    • AirBnB Data Warehouse Design Case Study

  • Class - 1 (Live Classes)
    • Snowflake free tier account setup
    • Snowflake UI walkthrough
    • Load data from UI and create snowflake
    • Event driven data ingestion in snowflake table using SnowPipe (Tech Stack Used : Google Storage Bucket, GCP Pub-Sub, Snowflake)
    • How to create and schedule task in snowflake

  • Class - 2 (Live Classes)
    • Project - 1: News Data Analysis with event driven incremental load in Snowflake table
    • Tech Stack:
      • Airflow
      • Google Cloud Storage
      • Python
      • Snowflake
    • Project - 2: Ecommerce CDC data real time aggregation in Snowflake Dynamic Table
    • Tech Stack:
      • Python
      • Snowflake Dynamic Table
    • Project - 3: Car rental data batch ingestion with SCD2 merge in snowflake table
    • Tech Stack:
      • Python
      • PySpark
      • GCP Dataproc
      • Snowflake
      • Airflow

  • Class - 3 (Recorded)
    • BigQuery Overview
    • BigQuery Architecture
    • Capacitor — Columnar format
    • Colossus — Storage
    • Dremel — Execution Engine
    • Borg — Compute
    • Jupiter — Network
    • Project - 1: IRCTC Streaming data ingestion into BigQuery
    • Tech Stack:
      • Python
      • GCP Storage
      • GCP Pub-Sub
      • BigQuery
      • Dataflow
    • Project - 2: Walmart data ingestion into BigQuery
    • Tech Stack:
      • Python
      • Airflow
      • GCP Storage
      • BigQuery

  • AWS Services Covered
    • S3, Lambda, IAM, CLOUDWATCH, EC2, SNS, SQS
    • Event Bridge Scheduler, Event Bridge Pipe, Kinesis, Kinesis Firehose, DynamoDB, SNS, SQS
    • Step Function, EMR, GLUE, RDS, ATHENA, REDSHIFT
  • Class - 1 (Live Classes)
    • AWS Free Tier Account Setup
    • AWS Console Walkthrough
    • S3 Bucket Creation
    • AWS CLI Setup
    • IAM User Setup
    • Access S3 Buckets using AWS CLI
    • S3 Bucket ARN
    • AWS Lambda Basics
    • Create Hello World Lambda function with Python
    • Execution and Testing of Lambda Function
    • Trigger Lambda Function with S3 Create Object Notification
    • Deployment of Lambda Functions with other dependencies
    • How to create and use Layers in Lambda
  • Class - 2 (Live Classes)
    • Read data from S3 file in Lambda Function with event driven notification & boto3 library
    • AWS SNS Basics
    • Create topics in SNS
    • Setup Email subscription of SNS topic
    • S3 Create object notification to SNS topic
    • Publish custom messages in SNS topic from Lambda function
    • AWS SQS Basics
    • SQS vs Kafka
    • Create SQS in AWS
    • Send and Receive messages in SQS
    • Read stream of messages in Lambda Function from SQS
    • AWS Event Bridge & Event Bridge Pipe
    • Scheduled trigger of Lambda function using event Bridge
    • Event bridge pipe to read stream of data from SQS and send to Lambda function with intermediate filters
  • Class - 3 (Live Classes)
    • Create EC2 instance in AWS
    • SSH in EC2 machine from terminal
    • AWS RDS
    • Setup MySQL database with AWS RDS
    • Login & Access MySQL Database from terminal
    • Connect and manipulate data in MySQL database using Python
    • AWS Athena Basics
    • Athena vs Spark
    • Create & Query Athena Tables
    • Setup Datasources in AWS Glue Catalog
    • Table metadata preparation with AWS Glue Crawler
    • Run Athena queries from Lambda Function
  • Class - 4 (Live Classes)
    • Crawl partitioned data in S3 with Glue Crawler
    • Read partitioned data from S3 in Athena
    • AWS Redshift fundamentals & architecture
    • Setup Redshift cluster
    • Table operations on sample data in redshift
    • Load data from S3 into Redshift table
    • Unload query command in Redshift
    • Unload data from Redshift into S3 with Manifest file
    • Create external table in Redshift
    • Materialized views in Redshift
    • AWS Glue fundamentals & components
    • AWS Glue Catalog & Glue Crawler
    • Setup Redshift connector in Glue
    • Data pipeline using AWS Glue Visualizer with S3 as Source and Redshift as Destination
    • AWS Glue job execution and insights

  • Introduction to Apache Iceberg: Overview, need, and key features like schema evolution, partitioning, and ACID compliance.
  • Iceberg Architecture: Understanding Iceberg’s metadata layer, file formats, and its role in data lakehouse architecture.
  • Installation & Setup: How to set up Iceberg on AWS, GCP, or Databricks with connectors for Hive, Spark, and Flink.
  • Creating Iceberg Tables: Steps to create, manage, and partition Iceberg tables efficiently.
  • Inserts, Updates, and Deletes: Handling data ingestion, updating records, and safely deleting data in Iceberg.
  • Time Travel & Snapshot Management: Using Iceberg’s time travel feature to view historical data and manage table snapshots.
  • Schema & Partition Evolution: Handling schema changes and evolving partitions without rewriting the table.
  • Query Optimization: Techniques for optimizing queries, using hidden partitioning, and improving performance in Spark and Flink.
  • Integrations & Use Cases: Integrating Iceberg with big data tools (Spark, Flink) and real-world use cases in large-scale data lakes.

  • Introduction to Apache Hudi: Overview of Hudi, key features like incremental data processing, ACID transactions, and how it compares to other table formats (Iceberg, Delta).
  • Hudi Architecture: Understanding Hudi’s core components—metadata, timeline, file groups, and the distinction between COW (Copy-On-Write) and MOR (Merge-On-Read) storage types.
  • Setting Up Hudi: Installation and configuration of Hudi on AWS, GCP, or Databricks, with integration in Spark, Flink, and Hive environments.
  • Creating Hudi Tables: Steps to create, manage, and work with COW and MOR tables in Apache Hudi for different use cases.
  • Hudi Upserts & Deletes: Efficiently handling updates, deletes, and managing incremental data with Hudi’s ACID capabilities.
  • Time Travel and Versioning: Utilizing Hudi’s time travel features to access historical data and manage table snapshots.
  • Hudi DeltaStreamer: Introduction to Hudi DeltaStreamer for streaming data ingestion and handling both batch and real-time updates.
  • Incremental Data Processing: How to leverage Hudi’s incremental pull mechanism for processing only changed data, reducing data processing overhead.
  • Integrations & Real-World Use Cases: Working with Hudi in big data tools (Spark, Hive, Flink) and exploring real-world examples of using Hudi in data lakes and streaming environments.

  • Introduction to Apache Flink: Overview of Flink, its core capabilities in real-time stream processing, and comparison with other streaming platforms like Kafka Streams and Spark Streaming.
  • Flink Architecture: Understanding Flink's core components—Job Manager, Task Manager, and how Flink manages state and fault tolerance.
  • Setting Up Flink on AWS: Configuration of AWS Managed Flink (Kinesis Data Analytics for Flink), setting up clusters, and integrating with AWS services like Kinesis and S3.
  • Working with Flink Streaming: Introduction to DataStream API, defining sources, sinks, and transformations for building real-time streaming pipelines.
  • State Management in Flink: Explaining Flink’s stateful streaming capabilities, how it maintains and manages state, and how to leverage keyed streams for stateful operations.
  • Event Time & Windowing: Using event time processing and defining window operators (Tumbling, Sliding, Session windows) for time-based aggregations in real-time streams.
  • Fault Tolerance & Checkpointing: Leveraging Flink’s exactly-once semantics, checkpointing, and savepoints to ensure reliable and fault-tolerant stream processing.
  • Flink SQL: Introduction to Flink SQL for querying real-time data streams, building end-to-end pipelines with SQL, and integrating with other systems like Kafka or Kinesis.
  • Flink on AWS Kinesis: Setting up real-time streaming jobs with Kinesis as a source and S3 as a sink, utilizing AWS services for monitoring, scaling, and managing Flink jobs.

  • All In Live Classes
  • Project - 1: Real-time Healthcare Data Processing with DLT (Delta Live Tables) in Databricks
    • Tech Stack:
      • PySpark
      • Databricks
      • Delta Tables
      • Databricks DLT Workflow
  • Project - 2: Booking.com Incremental SCD2 Merge Ingestion
    • Tech Stack:
      • PySpark
      • Databricks
      • Delta Tables
      • Databricks DLT Workflow
      • PyDeequ
  • Project - 3: News Data Analysis with Event-Driven Incremental Load in Snowflake Table
    • Tech Stack:
      • Airflow
      • Google Cloud Storage
      • Python
      • Snowflake
  • Project - 4: E-commerce CDC Data Real-time Aggregation in Snowflake Dynamic Table
    • Tech Stack:
      • Python
      • Snowflake
      • Dynamic Table
  • Project - 5: Car Rental Data Batch Ingestion with SCD2 Merge in Snowflake Table
    • Tech Stack:
      • Python
      • PySpark
      • GCP Dataproc
      • Snowflake
      • Airflow
  • Project - 6: IRCTC Streaming Data Ingestion into BigQuery
    • Tech Stack:
      • Python
      • GCP Storage
      • GCP Pub-Sub
      • BigQuery
      • Dataflow
  • Project - 7: Walmart Data Ingestion into BigQuery
    • Tech Stack:
      • Python
      • Airflow
      • GCP Storage
      • BigQuery
  • Project - 8: Quality Movie Data Analysis
    • Tech Stack:
      • S3
      • Glue Crawler
      • Glue Catalog
      • Glue Catalog Data Quality
      • Glue Low Code ETL (With PySpark)
      • Redshift
      • Event Bridge
      • SNS
  • Project - 9: Gadget Sales Data Projection
    • Tech Stack:
      • Python
      • DynamoDB
      • DynamoDB Streams
      • Kinesis Streams
      • Event Bridge Pipe
      • Kinesis Firehose
      • S3
      • Lambda
      • Athena
  • Project - 10: Airline Data Ingestion
    • Tech Stack:
      • S3
      • S3 Cloudtrail Notification
      • Event Bridge Pattern Rule
      • Glue Crawler
      • Glue Visual ETL (With PySpark)
      • SNS
      • Redshift
      • Step Function
  • Project - 11: Logistics Data Warehouse Management
    • Tech Stack:
      • GCP Storage
      • Airflow (GCP Composer)
      • Hive Operators
      • PySpark With GCP Dataproc
      • Hive
  • Project - 12: Sales Order & Payment Data Real Time Ingestion
    • Tech Stack:
      • GCP Pub-Sub
      • Python
      • Docker
      • Cassandra
  • New Projects With Aws, Flink, Iceberg, Hudi

Course Schedule

Course Starts On:
Live
9-Nov-2024
Course Duration:
100+ Hours
Session:
26
Validity:
1 Year (Starting From The Date Of Enrollment)
Class Timing:
Saturday & Sunday [9:00 AM - 12:00 PM Live Teaching, 12:00 PM to 1:00 PM Live Doubt Session] (IST)
Class Duration:
3 Hours Live Teaching, 60 minutes Doubt Solving
Class Recording Provided:
Yes
Programming Language Used:
Python
Prerequisite:
⚠️ Important Notice :
The video may not work on Linux due to DRM restrictions. It is only accessible on Chrome when using Windows or macOS.

Workaround: To access the video on Linux, you can create a Windows virtual machine (VM) and watch the video through the VM. Alternatively, you can use our Android or iOS application to view the video on your mobile device.

Instructor

Shashank Mishra is a seasoned Data Engineer with over 6 years of experience at top companies like Expedia, Amazon, PayTM, and McKinsey & Company. He specializes in Big Data, Cloud, and architecting scalable data pipelines across industries. A proud MCA graduate from NIT Allahabad, Shashank is passionate about sharing his expertise. Through his YouTube channel, E-Learning Bridge (175k+ subscribers), and LinkedIn (170k+ followers), he has mentored over 14,000 aspiring data professionals, helping them launch successful careers in Data Engineering.

Why Data Engineer?

Map your interest to check if Data Engineer profile is a close match for you.

Data Engineering 4.0 With AWS - Basic To Advance (Live Classes)
INR 7500



Namaste 🙏

Welcome To Grow Data Skills !!!
Our chat support representative will respond within an hour.

Whatsapp Enquiry On Whatsapp
9893181542
Email Enquiry On Email
9893181542
Send Message To Us
Hello! How Can I Help You?
Ă—

Enquiry