Morgan Stanley Interview Experience for Data Engineer 2021
Morgan Stanley is a leading global investment bank and wealth management firm and it focuses primarily on three main units such as institutional securities, wealth management, and investment management.
I am a B.tech graduate student in computer science and engineering from tier-3 college. I took a referral on LinkedIn for the Data Engineer position at Morgan Stanley’s Mumbai/Bengaluru Division. Morgan Stanley consists of five rounds for the Data Engineer role before final decisions are made.
The recruitment process was as follows:
Round 1: Preliminary Round (Online Test)
This is the TEDRA-India-Data-Engineering-Recruiting-Test conducted on the Hackerrank platform. The total duration of the test was 2 hours and consisted of 4 sections, each of which was individually timed. The sections were:
- SQL Coding Questions:
- Python & SQL MCQs Based Questions:
- One Coding Question Based on Data Structure:
- Relational Database Transactions & Administration Based Scenario Questions:
- MCQs Questions Based on Unix & Operating System:
I did my all SQL questions & all Technical-based MCQ sections decently and passed all test cases of the coding question. I completed my test in 1 hour 50 mins. The level of coding questions is medium & MCQs-based questions are medium to hard.
- Focus on basic level data structure topics such as Array, String, Stack, Queue, Linked List, and Tree (only Binary Search Tree).
- Primarily focus on SQL Concepts & Queries ( Aggregate Functions, Window Functions (very important), Different Types of Joins, How to use different types of joins in different scenarios, Union operation, Group by, Subqueries, Having clause, etc.) You can refer to https://www.geeksforgeeks.org/sql-tutorial/ for familiar with SQL concepts.
- Focus on DBMS concepts. You can refer to https://www.geeksforgeeks.org/dbms/ for learning DBMS concepts ( Entity Relationship Concepts, Normalization, Transaction & Concurrency Control).
- Primarily focus on Python Data Structures such as List, Tuple, Set & Dictionary (medium-level). You can also deep dive into Pandas (python library).
- Go through the Unix-based commands & how to write shell scripts (easy-level). Mostly, everyone is familiar with UNIX & Linux-based commands. You can simply go through https://www.geeksforgeeks.org/essential-linuxunix-commands/ before attempting the assessment.
Round 2: Technical Interview 1
- I got a call from HR that my online test is cleared and I was shortlisted for technical discussion. This round lasted for about 1 hour 20 minutes that is taken by the Manager at Morgan Stanley.
- This interview basically focused on SQL based questions, python coding questions, Big Data concepts, Spark, Data Ingestion tool Sqoop, Hive, HDFS, Map Reduce Concepts, Cloud Computing concepts, SDLC, Agile methodology (based on Scrum framework at a high level),
- DevOps Strategy (basic level), CI/CD pipeline, Git, Database concepts, NoSQL databases (Stardog Graph Database), AWS services-based scenario questions, and simple data structure bases questions (Array & Stack).
Some of the SQL coding questions that were asked:
- Consider two tables A & B that has column ‘id’ as follows:
I was asked to find the number of rows present in the final output in all four cases: inner join left join, right, and full outer join.
- Solve SQL query https://www.geeksforgeeks.org/sql-query-to-find-the-nth-largest-value-in-a-column-using-limit-and-offset/.
- Given an employee table with attributes are empId, empSalary, empDeptId and department table with attributes deptId, depName, CourseOffered. I was asked to write an SQL query to find the employee which has the highest salary in each department using windows functions on the notepad. I used the dense_rank window function for constructing SQL queries. I was asked to explain the reason for using dense_rank instead rank function.
- Some questions were based on lead, lag, and title window functions & their uses. I explained them very well by taking examples and writing SQL queries for the same.
Questions Based on Apache Hive & Sqoop:
- Firstly, he asked me to explain Hive as per the official definition. Which default database, hive uses for storing metadata. I told him “By default, Hive uses a built-in Derby SQL server”.
- Some deep dive scenario-based questions on hive bucketing & partitioning and their differences.
- Difference between external and managed hive tables using the concept of table’s metadata and ACID transactions.
- How we can get the data from the dropped external table (by mistake) of Hive?
- 1 question based on Sqoop incremental load and asked me to write Sqoop command for the same by taking any use cases.
- I was asked to write a Sqoop command to import all relational tables from MySQL into HDFS.
Question-Based on CI/CD, Git, and DevOps (Basic Level):
- I was asked to explain the full form of CI/CD. I told CI/CD stands for Continous Integration Continous Delivery/ Continous Deployment. Then, he asked me the difference between continuous delivery and continuous deployment.
- I was also asked to explain CI/CD working in detail. I explained in deep dive by taking the example of GitLab inbuilt CI/CD.
- Questions based on Gitlab runner and DevOps lifecycle (as I worked on Gitlab). He asked me to give some examples of DevOps tools that I am familiar with the same. I explained Jira and Jenkins at a high level to the interviewer.
A few other questions that I was asked were:
- I was asked to explain rack awareness of HDFS and the internal working of Apache Spark Architecture.
- He asked me to explain what will happen when you submit the spark application to the spark engine. Difference between narrow and wide transformations with examples.
- Difference between coalescing and repartitioning in spark. Which one is better in terms of performance. I was asked to explain whether the number of partitions created after applying to coalesce and repartition remains the same or different for the same dataset.
- How to schedule spark jobs using Databricks.
- Discussion based on how Hadoop achieves high availability.
- Conceptual Based questions were asked based on Data Lake, Data warehouse schemas (Star & Snowflake schema), cloud services ( e.g. little bit about (AWS EC2 machine, IAM policies & roles, how S3 bucket stores the data).
- He asked me to write code for uploading CSV files on the S3 bucket using the boto3 library. I wrote the code for the same using Python and boto3 library on a notepad. But, I don’t remember the proper syntax but he was ok with the approach & pseudo code.
- 2-3 questions based on SDLC and Agile methodology. I was asked to explain how much I am aware of Agile. I explained Agile with an agile framework (Scrum) by taking concepts of a sprint, Jira Board, iterative approach in detail. Why Agile is preferred over the waterfall model.
Finally, He asked me if you have any questions. I asked him about the tech stack that is used in their data platform team. He explained to me their tech stack in detail.
Round 3: Technical Interview 2
I got a mail for the second technical round which is taken by the Vice President of Morgan Stanley. This round lasted for about 45-50 minutes. The interview started with my introduction, my expertise, tech skillset that I had worked on. Most of the questions were asked based on Data Modeling, Databricks Lakehouse Architecture, PySpark & Architecture Design (ETL Design). Firstly, He gave me how to create a data model for relational databases. I used all the techniques of normalization & denormalization for creating the relational data model. Then,
- I was asked to explain the working & internal lakehouse architecture of Databricks. I was asked to explain data ingestion & data transformation concepts of the ETL pipeline in deep dive. I explained all the steps in detail by taking the example of the AWS glue ETL tool & Redshift data warehouse. Some more questions are as follows: The interviewer gave me a scenario and asked me to explain what preliminary checks I should take care of while designing the ETL pipeline.
- One- two questions were asked based on batch processing & stream processing using Spark.
- How staging layer works in data pipeline & its uses.
- Questions were based on data warehouse architecture.
- I was asked to explain the use case of unit test cases & how will I create unit test cases using SQL & PySpark code (as I worked on unit test cases creation in my previous project).
- He asked me to write pseudo code for building a dummy ETL pipeline using Python. I used python data structure & pandas library for extracting, cleansing the data, transforming the data, and loading the final dataset into CSV format.
- Questions asked based on Spark monitoring & Spark performance management. I explained all the answers in deep dive by taking practical examples.
This round was totally focused on data modeling, PySpark & ETL pipeline design.
Round 4: Techno Managerial Round
This interview was taken by the Executive director of Morgan Stanley. This round lasted for about 45 minutes. I was asked to introduce myself. Then, there is a discussion on the college projects that I had worked on, my internship experience at ZS Associates, my roles & responsibilities in the project. I was also asked to explain my research papers on Web Crawler for Ranking of Websites Based on Web Traffic and Page Views that I published in International Conferences of IEEE & Springer. He liked my strengths and willingness of writing research papers in the B.tech course. Some of the questions were related to the core principles of Morgan Stanley and my inspirations. Then he asked questions related to team management & leadership qualities. I was mainly asked questions that were situation-based such as ” How you can overcome challenges faced in the team”. Then, he jumped into my resume and asked me some technical questions related to Spark, Databricks, AWS & Delta Lakes. Some questions that I remembered are:
- How AWS glue fetch the metadata of the data sources (either parquet files or CSV files stored on S3 bucket).
- What is parquet file format & what it’s significance in delta tables?
- What is the use of delta tables in your project? What are the advantages of delta tables?
- What format does Delta Lake use to store data?
- I was asked to explain the configuration of spark cluster in terms of executor memory, number of worker nodes, number of executors per worker nodes, number of cores per executor & driver memory of my project.
- Some questions were related to the S3 bucket, Redshift Data warehouses & Relational Databases.
- Explain briefly about the internal working of Hadoop
At last, He asked about my B.tech grades. I told him that I am a University topper of B.tech course. He was very impressed with my answers and also impressed with my B.tech percentage.
Round 5: HR Round
- This round lasted for about 15-20 minutes. I was asked about my Big Data project experience, my hobbies, my strength & weakness. He asked about my family background, previous interview experiences, my ultimate goal in my life. At last, He asked me “Why should we hire you” & “What inspires you to join Morgan Stanley”. At last, there was a salary discussion with HR.
- On the next day, I got positive feedback from HR. Fortunately, I got selected for the Data Engineer position at Morgan Stanley.
Finally, I am part of my dream company, Morgan Stanley