Real time SQL Tasks Questions and Answers for Azure Data Engineer

Questions

Can you explain what is SCD Type 1, Type 2, and Type 3?
Can you write a Pyspark Code to implement SCD Type 2 in Azure Synapse?
Why is SCD important in Data Warehousing?
Did you use Data Profiling, Data Lineage, or Data Cleansing?
Can you write a Pyspark Code to detect and remove Duplicate Records?
Write an SQL Query to fetch the Second Highest Salary from the Employee Table.
What if I want the 3rd Highest or Nth Highest Salary?
Write an SQL Query to identify users with 3 or more consecutive failed login attempts.
Write an Incremental Load Query to only capture Newly Inserted Records from Source.
Write an SQL Query to Identify Duplicate Records from the Orders Table.
Write an SQL Query to calculate the Age of Each Customer based on Date of Birth.
Write an SQL Query to Rank Employees by Highest Salary per Department.

Questions & Answers

Can you explain what is SCD Type 1, Type 2, and Type 3?

In Data Warehousing, Slowly Changing Dimensions (SCD) refers to managing and tracking changes in data over time. There are mainly three types:

SCD Type 1 (Overwrite the Data):
- In SCD Type 1, we do not keep any history of changes. When there is an update in the source data, it directly overwrites the existing data in the target.
- Example: If a customer’s address changes from A to B, the address field will simply update to B without retaining the history of A.
SCD Type 2 (Maintain Historical Data):
- SCD Type 2 keeps a history of changes by adding a new record with a new surrogate key whenever there is a change in the source data.
- Example: If a customer’s address changes from A to B, a new record will be inserted with the new address and a new surrogate key.
- It uses Effective Date, End Date, and Current Flag to identify active records.
SCD Type 3 (Maintain Partial History):
- In SCD Type 3, we only store limited history by maintaining two columns:
  - Current Value: Stores the latest data.
  - Previous Value: Stores the old data.
- Example: If a customer’s address changes from A to B, the old address will be moved to the Previous Address column, and the new address will become the Current Address.

Can you write a PySpark Code to implement SCD Type 2 in Azure Synapse?

Here is the PySpark Code for implementing SCD Type 2:

from pyspark.sql import SparkSession

from pyspark.sql.functions import lit, current_date

# Create Spark Session

spark = SparkSession.builder.appName(“SCD Type 2”).getOrCreate()

# Source Data (Incremental Data from Source System)

source_data = [

(1, ‘John’, ‘Texas’),

(2, ‘Mike’, ‘New York’),

(3, ‘Sara’, ‘California’)

]

source_df = spark.createDataFrame(source_data, [‘CustomerID’, ‘Name’, ‘City’])

# Target Data (Existing Data in Data Warehouse)

target_data = [

(1, ‘John’, ‘California’, ‘2022-01-01’, ‘9999-12-31’, ‘Y’),

(2, ‘Mike’, ‘New York’, ‘2022-01-01’, ‘9999-12-31’, ‘Y’)

]

target_df = spark.createDataFrame(target_data, [‘CustomerID’, ‘Name’, ‘City’, ‘StartDate’, ‘EndDate’, ‘CurrentFlag’])

# Perform Merge Operation

from delta.tables import DeltaTable

target_table = DeltaTable.forPath(spark, “/mnt/data/target_table”)

# Merge Logic

target_table.alias(“tgt”).merge(

source_df.alias(“src”),

“tgt.CustomerID = src.CustomerID AND tgt.CurrentFlag = ‘Y'”

).whenMatchedUpdate(

condition=”tgt.City != src.City”,

set={

“EndDate”: current_date(),

“CurrentFlag”: “N”

}

).whenNotMatchedInsert(

values={

“CustomerID”: “src.CustomerID”,

“Name”: “src.Name”,

“City”: “src.City”,

“StartDate”: current_date(),

“EndDate”: lit(“9999-12-31”),

“CurrentFlag”: “Y”

}

).execute()

Why is SCD important in Data Warehousing?

SCD (Slowly Changing Dimensions) is crucial in Data Warehousing because:

It helps in tracking historical data changes.
It enables time-travel analysis, where businesses can see how a customer’s behavior has changed over time.
It is critical for reporting and analytics.
Without SCD, data changes would overwrite old data, resulting in data loss.

Did you use Data Profiling, Data Lineage, or Data Cleansing?

Yes, I have extensively worked on Data Profiling, Data Lineage, and Data Cleansing in my projects.

Data Profiling: I performed data profiling using Azure Data Factory and Databricks to understand data distribution, null values, duplicates, and patterns.
Data Lineage: Implemented Data Lineage using Azure Purview to trace data flow from source to destination, ensuring data governance and compliance.
Data Cleansing: Used PySpark and SQL to remove duplicates, correct null values, and standardize data for better quality.

Can you write a PySpark Code to detect and remove Duplicate Records?

from pyspark.sql import SparkSession

# Create Spark Session

spark = SparkSession.builder.appName(“Remove Duplicates”).getOrCreate()

# Sample Data

data = [(1, ‘John’, ‘Texas’),

(1, ‘John’, ‘Texas’),

(2, ‘Mike’, ‘New York’),

(3, ‘Sara’, ‘California’)]

df = spark.createDataFrame(data, [‘ID’, ‘Name’, ‘City’])

# Remove Duplicates

df_no_duplicates = df.dropDuplicates()

df_no_duplicates.show()

Write an SQL Query to fetch the Second Highest Salary from the Employee Table.

SELECT MAX(Salary) AS SecondHighestSalary

FROM Employee

WHERE Salary < (SELECT MAX(Salary) FROM Employee);

What if I want the 3rd Highest or Nth Highest Salary?

For the Nth highest salary:

SELECT DISTINCT Salary

FROM Employee

ORDER BY Salary DESC

LIMIT 1 OFFSET (N-1);

For 3rd highest salary:

SELECT DISTINCT Salary

FROM Employee

ORDER BY Salary DESC

LIMIT 1 OFFSET 2;

Write an SQL Query to identify users with 3 or more consecutive failed login attempts.

SELECT UserID

FROM LoginAttempts

WHERE Status = ‘Failed’

GROUP BY UserID, LoginDate

HAVING COUNT(*) >= 3;

Write an Incremental Load Query to only capture Newly Inserted Records from Source.

SELECT *

FROM SourceTable S

LEFT JOIN TargetTable T

ON S.ID = T.ID

WHERE T.ID IS NULL;

Write an SQL Query to Identify Duplicate Records from the Orders Table.

SELECT OrderID, COUNT(*)

FROM Orders

GROUP BY OrderID

HAVING COUNT(*) > 1;

Write an SQL Query to calculate the Age of Each Customer based on Date of Birth.

SELECT Name, DateOfBirth,

YEAR(CURRENT_DATE) – YEAR(DateOfBirth) AS Age

FROM Customers;

Write an SQL Query to Rank Employees by Highest Salary per Department.

SELECT EmployeeID, Name, Department, Salary,

RANK() OVER(PARTITION BY Department ORDER BY Salary DESC) AS SalaryRank

FROM Employees;