If you are unable to create a new account, please email support@bspsoftware.com

 

News:

MetaManager - Administrative Tools for IBM Cognos
Pricing starting at $2,100
Download Now    Learn More

Main Menu

DW design flow from Source

Started by cognostechie, 28 Aug 2014 03:49:42 PM

Previous topic - Next topic

cognostechie

Don't know where else to post this so posting this here -

This is more of a design question rather than technical.

When you make a Datawarehouse/Datamart and you have a source system to get the data from, what approach do people take and why (pros and cons)

1> Connect to the source system, read the data and put the data in another DB which becomes a replica of the source system. After this, read the data from this replica , do the transformation using a staging DB and then put the data in the DW/DM

2> Create a copy of the source system DB, read the data from the copy , do the transformation and then put it in DW/DM

3> Do an online backup of the source system or use a replication tool (most DB vendors offer these) and then use the replicated
DB as the source and then do the transformation and then put it in DW/DM

4> Any other method?

Which method do you use and what is it that you prefer? pros and cons? Experiences ?

MFGF

Quote from: cognostechie on 28 Aug 2014 03:49:42 PM
Don't know where else to post this so posting this here -

This is more of a design question rather than technical.

When you make a Datawarehouse/Datamart and you have a source system to get the data from, what approach do people take and why (pros and cons)

1> Connect to the source system, read the data and put the data in another DB which becomes a replica of the source system. After this, read the data from this replica , do the transformation using a staging DB and then put the data in the DW/DM

2> Create a copy of the source system DB, read the data from the copy , do the transformation and then put it in DW/DM

3> Do an online backup of the source system or use a replication tool (most DB vendors offer these) and then use the replicated
DB as the source and then do the transformation and then put it in DW/DM

4> Any other method?

Which method do you use and what is it that you prefer? pros and cons? Experiences ?

My approach has often leaned towards your Option 1. It's a very rare occurrence that *all* the tables and columns from a source system are required for use in your data warehouse, so it has always seemed wasteful to dump the whole thing (either by copy or backup/restore) into a separate database.

Generally I would advocate that you have processes that read the required data (and only the required data) from a source system and write it into a staging area as quickly and efficiently as possible, to keep the overhead on the source database as minimal as possible. This often means a straight table copy (with no changes or transformation happening) from source to staging, for each required table. My approach was "get the data off there as quickly and efficiently as possible" so the time window the source system db needed to be static for could be as short as possible. In Data Manager this would be a series of "Data Transfer" builds, often run in parallel. The staging data is not necessarily a replica of the source system, since it is only a subset of the tables in there. I would repeat this process for all source systems, ending up with a collection of tables from all of the sources populating my staging area.

Next would come a series of transformation/cleaning/merging builds to assemble clean, consistent data for use as a source for the final presentation layer builds. I often kept these tables in a separate schema from Staging - I would term it my "Enterprise Data" layer.

The final step would be the dimension and fact builds to populate the star schemas in the presentation database.

Obviously this is a very simplistic, idealistic approach, and often there are constraints that mean you need to deviate from this to some degree. It would always be my ideal objective, though.

Cheers!

MF.
Meep!

cognostechie

Thanks MFGF ! My only concern with that is that when the data is pulled from the source system (Production DB) , at the time if the tables are locked by some processes/code of the source application then the DW process might have to 'wait' till the table becomes available?   

bdbits

Over the last few years, we have come to take a very similar approach. We like to have raw staging area that is normally populated with copies of the required source tables and may be blown away and rebuilt on every ETL run if desired. Then we have an ODS (operational data store) of cleansed data derived the current staged data and existing ODS data if it persists (likely), but this ODS is not dimensional. And finally, a dimensional layer(s) spun out of the ODS as needed. One advantage of the multiple layers is that you can inject additional attributes, e.g. temporal attributes to create history, even when the source system(s) do not have them. You can also collapse similar data from multiple systems into one table (e.g. customers, clients, etc., think Master Data Management). And the dimensional systems you spin out of the ODS can be completely rebuilt, if necessary, since you never lose the original data.

Fortunately nearly all of our source systems are not really used much if at all late at night, which is when the extractions typically run. We plan around other scheduled uses. So record locking is not much of an issue. We do have one system that utilizes CDC (change data capture) to capture data on the fly to a new target database that we read from and thus do not run into application locks, but this is an expensive approach. You could use DB replication tools for much the same thing, if your data is all on one platform.

Obviously there are a lot of tradeoffs and no one size fits all situations.

Gary

In addition to above approach:
get the data extracted from the source systems in form of .csv (generally the Delta extract) and make an incremental load to so called EDS system where we are maintaining the history. By doing this we increase the re-usability in case the load fails in Transformation/Integration step. We need not to connect to the source all the time to Re-init the load.

Yes there would be overhead of generating the files but the direct connection to the source will be of less duration.
Gary

:)*Throw Parties Not Grenades*:)

cognostechie

Very interesting ! The CSV does not have any limitation on the size?

Thanks everybody for the replies.

Gary

As per best practices we should not keep the csv file size limit to more than 4-5 Gb.

But you can limit the number of records in each file to create multiple files and can load them all together reading at the same time.

Gary
Gary

:)*Throw Parties Not Grenades*:)