If you are unable to create a new account, please email support@bspsoftware.com

 

News:

MetaManager - Administrative Tools for IBM Cognos
Pricing starting at $2,100
Download Now    Learn More

Main Menu

Concept behind partitions

Started by dssd, 26 Apr 2011 04:11:14 AM

Previous topic - Next topic

dssd

I went through the manual and read about  paritioning. It says partitioning presummarises data in a cube. What does it mean by pre summarize?

And that we end up with a parition which has the summary data and other partitions with detailed data.

Doesn't the cube summarize data anyway?  Is it that without parition the summary data is all over the place whereas after partition its in one section which can be accessed easily and faster. Got this doubt since cubes have some sort of indexing and as a result accessing data within a cube shouldnt be a issue with or without partitions

Arsenal

the easiest way to understand it it this - the goal of partioning (whether manual or automatic) is always and always to strike a balance between query response time and cube build time. Smaller the partition, quicker the query response (because it has to scan smaller partitions) but longer the cube build time. Now, this is where you need to strike the balance - you don't want a very quick cube build but very long query response time, nor do you want a quick query response but a large cube build time.

so what does partitioning really do? Partitioning will presummarize the data in the PowerCube and store it into partitions for quicker retrieval. If a cube is very large and doesn't have these partitions, user response time will be slow. But a cube with a large number of these partitions will just increase the build time.

partitioning performs well when the number of categories and their children are limited in that a few at the top level with a few hundred at the child level. But, if the top level itself contains a few hundred categories with tens of thousands in the level below it then partioning will not work well.

you can play around with the numbers on the auto partition tab of the powercube properties. Build the cube once with default settings and then look in the log file. Get the end count consolidation off it and put that in the estimated number of consolidation records in the tab. Change some of the other settings and build the cube a few times. The key, for each cube build, is to check the log file and compare the "Performing Pass 0" numbers and the Last Pass (Pass 6, 7 etc...whatever is the last pass number) numbers and ensure that the row count of the 0th pass is less than the row count of the last pass. Additionally, the category count of the last pass should be lesser than the 0th pass (although may not be significantly less)

You can also opt to choose manual partitioning but that is a can of worms you may want to avoid unless you know your data inside out (how many categories in each dimension, how are they expected to change over time etc. )and are prepared to spend a lot of time with trial and error to find out the best strategy

Arsenal

that the row count of the 0th pass is less than the row count of the last pass

Sorry, that should be the row count of the last pass should be less than the row count of the 0th pass  :-[

dssd

Thanks Arsenal for the detailed description.

The cube already summarizes data so what do we mean by the term partition "Pre Summarizes" data. I am not able to capture the distinction.

dssd

Also whats the logical reasoning behind the below

partitioning performs well when the number of categories and their children are limited in that a few at the top level with a few hundred at the child level. But, if the top level itself contains a few hundred categories with tens of thousands in the level below it then partioning will not work well.

Arsenal

Quote from: dssd on 26 Apr 2011 03:14:58 PM
Thanks Arsenal for the detailed description.

The cube already summarizes data so what do we mean by the term partition "Pre Summarizes" data. I am not able to capture the distinction.

same thing. I was using the two terms interchangeably. Think of it as - the cube stores the data it summarizes in partitions.

Arsenal

Quote from: dssd on 26 Apr 2011 03:17:42 PM
Also whats the logical reasoning behind the below

partitioning performs well when the number of categories and their children are limited in that a few at the top level with a few hundred at the child level. But, if the top level itself contains a few hundred categories with tens of thousands in the level below it then partioning will not work well.

when you have a very high parent to child (for example, a dimension where the level is countries but the child level is zip/postal codes codes. There are only about 190 odd countries but each country will have thousands of postal codes which is a very high parent to child category ratio. But, if you dimension was built like Countries--->States--->Cities---->Zip Codes then this dimension will have a more balanced parent to child category ratio in that cities will have a few dozen zip codes, states will have a few doizen cities and so on as compared to the Countries---> Zip code dimension), you need to set a large enough partition size to handle this "unbalanced" dimension. Often times, this "large enough" partition ends up defeating the purpose of partition which was mentioned in ther prior post - to select a balance between query response and cube build

You have to handle these types of dimensions with a manual partition strategy with a lot of trial and error or you have to comprise by having slightly slower query response or you have to redesign your cube model/data model to break up the "unbalanced" dimension

dssd

Quote from: Arsenal on 26 Apr 2011 03:23:57 PM
same thing. I was using the two terms interchangeably. Think of it as - the cube stores the data it summarizes in partitions.

Thanks. I think i got it now, just one more question to seal the matter.
So, if you dont partition, summarized data is all over the place within the cube, which makes it slower to access in a report, right?

Cubes are built with indexes. So, shouldn't that help faster navigation even if its all over the place. Or is it that cubes become so huge that even indexes don't help and hence we need partitions

Arsenal

Quote from: dssd on 27 Apr 2011 05:21:02 AM
Thanks. I think i got it now, just one more question to seal the matter.
So, if you dont partition, summarized data is all over the place within the cube, which makes it slower to access in a report, right?

Cubes are built with indexes. So, shouldn't that help faster navigation even if its all over the place. Or is it that cubes become so huge that even indexes don't help and hence we need partitions

To the best of my knowledge, there are no indexes within the cube. When a cube is being built, then an "index" on key columns etc. might help it read data faster FROM the DB (though I myself have a question on how indexing on FM ports over to Transformer) but the data itself that is stored on the cube does not have any "indexing" on it.

You can think of the partition as a sort of index if that helps, although that's not quite 100% correct in theory  ;)

RobsWalker68

Hi,

With regards to the question about indexes, when a powercube is partioned each category will be assigned an index to the individual rows that it needs within the various partitions.  If the category is within a single partition then the index size will be reduced as it only points to rows within that partition.

Kind Regards

Rob 

Arsenal

Thanks Rob. I didn't know about the indexes within the partitions

barrysaab

Thanks,Arsenal.Once again your are of great help.
Boy! Cognos getting on to me!!!