Advanced DataWarehousing Concepts and Interview Questions.
How do you implement Slowly Changing Dimension type 2? I am not looking for the definition, but the practical implementation e.g. table structure, ETL/loading. {M}Answer: Create the dimension table as normal, i.e. first the dim key column as an integer, then the attributes as varchar (or varchar2 if you use Oracle). Then I’d create 3 additional columns: IsCurrent flag, “Valid From” and “Valid To” (they are datetime columns). With regards to the ETL, I’d check first if the row already exists by comparing the natural key. If it exists then “expire the row” and insert a new row. Set the “Valid From” date to today’s date or the current date time.
An experienced candidate (particularly DW ETL developer) will not set the “Valid From” date to the current date time, but to the time when the ETL started. This is so that all the rows in the same load will have the same Valid From, which is 1 millisecond after the expiry time of the previous version thus avoiding issue with ETL workflows that run across midnight.
Purpose: SCD 2 is the one of the first things that we learn in data warehousing. It is considered the basic/fundamental. The purpose of this question is to separate the quality candidate from the ones who are bluffing. If the candidate can not answer this question you should worry.
How do you index a fact table? And explain why. {H}
Answer: Index all the dim key columns, individually, non clustered (SQL Server) or bitmap (Oracle). The dim key columns are used to join to the dimension tables, so if they are indexed the join will be faster. An exceptional candidate will suggest 3 additional things: a) index the fact key separately, b) consider creating a covering index in the right order on the combination of dim keys, and c) if the fact table is partitioned the partitioning key must be included in all indexes.
Purpose: Many people know data warehousing only in theory or only in logical data model. This question is designed to separate those who have actually built a data warehouse and those who haven’t.
In the source system, your customer record changes like this: customer1 and customer2 now becomes one company called customer99. Explain a) impact to the customer dim (SCD1), b) impact to the fact tables. {M}
Answer: In the customer dim we update the customer1 row, changing it to customer99 (remember that it is SCD1). We do soft delete on the customer2 row by updating the IsActive flag column (hard delete is not recommended). On the fact table we find the Surrogate Key for customer1 and 2 and update it with customer99’s SK.
Purpose: This is a common problem that everybody in data warehousing encounters. By asking this question we will know if the candidate has enough experience in data warehousing. If they have not come across this (probably they are new in DW), we want to know if they have the capability to deal with it or not.
Question: What are the differences between Kimball approach and Inmon’s? Which one is better and why?
Answer: if you are looking for a junior role e.g. a developer, then the expected answer is: in Kimball we do dimension modelling, i.e. fact and dim tables whereas in Inmon’s we do CIF, i.e. EDW in normalised form and we then create a DM/DDS from the EDW. Junior candidates usually prefer Kimball, because of query performance and flexibility, or because that’s the only one they know; which is fine. But if you are interviewing for a senior role e.g. senior data architect then they need to say that the approach depends on the situation. Both Kimball & Inmon’s approaches have advantages and disadvantages.
Purpose: a) to see if the candidate understands the core principles of data warehousing or they just “know the skin”, b) to find out if the candidate is open minded, i.e. the solution depends on what we are trying to achieve (there’s right or wrong answer) or if they are blindly using Kimball for every situation.
Suppose a fact row has unknown dim keys, do you load that row or not? Can you explain the advantage/disadvantages?
Answer: We need to load that row so that the total of the measure/fact is correct. To enable us to load the row, we need to either set the unknown dim key to 0 or the dim key of the newly created dim rows. We can also not load that row (so the total of the measure will be different from the source system) if the business requirement prefer it. In this case we load the fact row to a quarantine area complete with error processing, DQ indicator and audit log. On the next day, after we receive the dim row, we load the fact row. This is commonly known as Late Arriving Dimension Rows and there are many sources for further information; .
Purpose: Again this is a common problem that we encounter in regular basis in data warehousing. With this question we want to see if the candidate’s experience level is up to the expectation or not.
Please tell me your experience on your last 3 data warehouse projects. What were your roles in those projects? What were the issues and how did you solve them?
Answer: There’s no wrong or right answer here. With this question you are looking for
a) whether they have done similar things to your current project,
b) whether their have done the same role as the role you are offering,
c) whether they faces the same issues as your current DW project.
Purpose: Some of the reasons why we pay more to certain candidates compared to the others are:
a) they have done it before they can deliver quicker than those who haven’t.
b) they come from our competitors so we would know what’s happening there and we can make a better system than theirs.
c) they have solved similar issues so we could “borrow their techniques”.
What are the advantages of having a normalised DW compared to dimensional DW? What are the advantages of dimensional DW compared to normalised DW?
Answer: The advantages of dimensional DW are:
a) flexibility, e.g. we can accommodate changes in the requirements with minimal changes on the data model, b) performance, e.g. you can query it faster than normalised model,
c) it’s quicker and simpler to develop than normalised DW and easier to maintain.
Purpose: to see if the candidate has seen “the other side of the coin”. Many people in data warehousing only knows Kimball/dimensional. Second purpose of this question is to check if the candidate understands the benefit of dimensional modelling, which is a fundamental understanding in data warehousing.
What is 3rd normal form? {L} Give me an example of a situation where the tables are not in 3rdrd NF.
Answer: No column is transitively depended on the PK. For example, column1 is dependant on column2 and column2 is dependant on column3. In this case column3 is “transitively dependant” on column1. To make it 3rd NF we need to split it into 2 tables: table1 which has column1 & column2 and table2 which has column2 and column3.
Purpose: A lot of people talk about “3rd normal form” but they don’t know what it means. This is to test if the candidate is one of those people. If they can’t answer 3rd NF, ask 2nd NF. If they can’t answer 2nd NF, ask 1st NF.
Tell me how to design a data warehouse, i.e. what are the steps of doing dimensional modelling?
Answer: There are many ways, but it should not be too far from this order: 1. Understand the business process, 2. Declare the grain of the fact table, 3. Create the dimension tables including attributes, 4. Add the measures to the fact tables (from Kimball’s Toolkit book chapter 2). Step 3 and 4 could be reversed (add the fact first, then create the dims), but step 1 & 2 must be done in that order. Understanding the business process must always be the first, and declaring the grain must always be the second.
Purpose: This question is for data architect or data warehouse architect to see if they can do their job. It’s not a question for an ETL, report or cube developer.
How do you join 2 fact tables?
Answer: It’s a trap question. You don’t usually join 2 fact tables especially if they have different grain. When designing a dimensional model, you include all the necessary measures into the same fact table. If the measure you need is located on another fact table, then there’s something wrong with the design. You need to add that measure to the fact table you are working with. But what if the measure has a different grain? Then you add the lower grain measure to the higher grain fact table. What if the fact table you are working with has a lower grain? Then you need to get the business logic for allocating the measure.
It is possible to join 2 fact tables, i.e. using the common dim keys. But the performance is usually horrible, hence people don’t do this in practice, except for small fact tables (<100k rows). For example: if FactTable1 has dim1key, dim2key, dimkey3 and FactTable2 has dim1key and dim2key then you could join them like this:
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2So if we don’t join 2 fact tables that way, how do we do it? The answer is using the fact key column. It is a good practice (especially in SQL Server because of the concept of cluster index) to have a fact key column to enable us to identify rows on the fact table . The performance would be much better (than joining on dim keys), but you need to plan this in advance as you need to include the fact key column on the other fact table.
from
( select dim1key, dim2key, sum(measure1) as measure1
from FactTable1
group by dim1key, dim2key
) f1
join FactTable2 f2
on f1.dim1key = f2.dim1key and f1.dim2key = f2.dim2key
select f2.dim1key, f2.dim2key, f1.measure1, f2.measure2I implemented this technique originally for self joining, but then expand the usage to join to other fact table. But this must be used on an exception basis rather than the norm.
from FactTable1 f1
join FactTable2 f2
on f2.fact1key = f1.factkey
Purpose: not to trap the candidate of course. But to see if they have the experience dealing with a problem which doesn’t happen every day.
Related Posts
No comments :