Advanced DataWarehousing Concepts and Interview Questions

DataWarehousing Concepts and Questions Answers.

How do you index a dimension table?
Answer: clustered index on the dim key, and non clustered index (individual) on attribute columns which are used on the query’s “where clause”.

Purpose: this question is critical to be asked if you are looking for a Data Warehouse Architect (DWA) or a Data Architect (DA). Many DWA and DA only knows logical data model. Many of them don’t know how to index. They don’t know how different the physical tables are in Oracle compared to in Teradata. This question is not essential if you are looking for a report or ETL developer. It’s good for them to know, but it’s not essential

Tell me what you know about William Inmon?
Answer: He was the one who introduced the concept of data warehousing. Arguably Barry Devlin was the first one, but he’s not as popular as Inmon. If you ask who is Barry Devlin or who is Claudia Imhoff 99.9% of the candidates wouldn’t know. But every decent practitioner in data warehousing should know about Inmon and Kimball.

Purpose: to test if the candidate is a decent practitioner in data warehousing or not. You’ll be surprise (especially if you are interviewing a report developer) how many candidates don’t know the answer. If someone is applying for a BI architect role and he never heard about Inmon you should worry.

How do we build a real time data warehouse?
Answer: if the candidate asks “Do you mean real time or near real time” it may indicate that they have a good amount of experience dealing with this in the past. There are two ways we build a real time data warehouse (and this is applicable for both Normalised DW and Dimensional DW):
a) By storing previous periods’ data in the warehouse then putting a view on top of it pointing to the source system’s current period data. “Current period” is usually 1 day in DW, but in some industries e.g. online trading and ecommerce, it is 1 hour.
b) By storing previous periods’ data in the warehouse then use some kind of synchronous mechanism to propagate current period’s data. An example of synchronous data propagation mechanism is SQL Server 2008’s Change Tracking or the old school’s trigger.
Near real time DW is built using asynchronous data propagation mechanism, aka mini batch (2-5 mins frequency) or micro batch (30s – 1.5 mins frequency).

Purpose: to test if the candidate understands complex, non-traditional mechanism and follows the latest trends. Real time DW was considered impossible 5 years ago and only developed in the last 5 years. If the DW is normalised it’s easier to make it real time than if the DW is dimensional as there’s dim key lookup involved.

What is the difference between a data mart and a data warehouse?
Answer: Most candidates will answer that one is big and the other is small. Some good candidates (particularly Kimball practitioners) will say that data mart is one star. Whereas DW is a collection of all stars. An excellent candidate will say all the above answers, plus they will say that a DW could be the normalized model that store EDW, whereas DM is the dimensional model containing 1-4 stars for specific department (both relational DB and multidimensional DB).

Purpose: The question has 3 different levels of answer, so we can see how deep the candidate’s knowledge in data warehousing.

What the purpose of having a multidimensional database?
Answer: Many candidates don’t know what a multidimensional database (MDB) is. They have heard about OLAP, but not MDB. So if the candidate looks puzzled, help them by saying “an MDB is an OLAP database”. Many will say “Oh… I see” but actually they are still puzzled so it will take a good few moments before they are back to earth again. So ask again: “What is the purpose of having an OLAP database?” The answer is performance and easier data exploration. An MDB (aka cube) is a hundred times faster than relational DB for returning an aggregate. An MDB will be very easy to navigate, drilling up and down the hierarchies and across attributes, exploring the data.

Purpose: This question is irrelevant to report or ETL developer, but a must for a cube developer and DWA/DA. Every decent cube developer (SSAS, Hyperion, Cognos) should be able to answer the question as it’s their bread and butter.

Why do you need a staging area?
Answer: Because:
a) Some data transformations/manipulations from source system to DWH can’t be done on the fly, but requires several stages and therefore needs to “be landed on disk first”
b) The time to extract data from the source system is limited (e.g. we were only given 1 hour window) so we just “get everything we need out first and process later”
c) For traceability and consistency, i.e. some data transform are simple and some are complex but for consistency we put all of them on stage first, then pick them up from stage for further processing
d) Some data is required by more than 1 parts of the warehouse (e.g. ODS and DDS) and we want to minimise the impact to the source system’s workload. So rather than reading twice from the source system, we “land” the data on the staging then both the ODS and the DDS read the data from staging.

Purpose: This question is intended more for an ETL developer than a report/cube developer. Obviously a data architect needs to know this too.

How do you decide that you need to keep it as 1 dimension or split it into 2 dimensions? Take for example dim product: there are attributes which are at product code level and there are attributes which are at product group level. Should we keep them all in 1 dimension (product) or split them into 2 dimensions (product and product group)?
Answer: Depends on how they are going to be used, as I explained in my article “One or two dimensions”
Purpose: To test if the candidate is conversant in dimensional modelling. This question especially is relevant for data architects and cube developers and less relevant for a report or ETL developer.

Fact table columns usually numeric. In what case does a fact table have a varchar column?
Answer: degenerate dimension
Purpose: to check if the candidate has ever involved in detailed design of warehouse tables.

What kind of dimension is a “degenerate dimension”? Give me an example.
Answer: A “dimension” which stays in the fact table. It is usually the reference number of the transaction. For example: Transaction ID, payment ref and order ID
Purpose: Just another question to test the fundamentals.

What is show flaking? What are the advantages and disadvantages?
Answer: In dimensional modelling, snow flaking is breaking a dimension into several tables by normalising it. The advantages are:
a) performance when processing dimensions in SSAS
b) flexibility if the sub dim is used in several places e.g. city is used in dim customer and dim supplier (or in insurance DW: dim policy holder and dim broker),
c) one place to update, and d) the DW load is quicker as there are less duplications of data.
The disadvantages are:
a) more difficult in “navigating the star*”, i.e. need joins a few tables
b) worse “sum group by*” query performance (compared to “pure star*”)
c) more flexible in accommodating requirements, i.e. the city attributes for dim supplier don’t have to be the same as the city attributes for dim customer, d) the DW load is simpler as you don’t have to integrate the city.

*: a “star” is a fact table with all its dimensions, “navigating” means browsing/querying, “sum group by” is a SQL select statement with a “group by” clause, pure star is a fact table with all its dimensions and none of the dims are snow-flaked.

Purpose: Snow flaking is one of the classic debates in dimensional modelling community. It is useful to check if the candidate understands the reasons of just “following blindly”. This question is applicable particularly for data architect and OLAP designer. If their answers are way off then you should worry. But it also relevant to ETL and report developers as they will be populating and querying the structure

Advanced DataWarehousing Concepts and Interview Questions

DataWarehousing Concepts and Questions Answers.

No comments :