If you are unable to create a new account, please email support@bspsoftware.com

 

News:

MetaManager - Administrative Tools for IBM Cognos
Pricing starting at $2,100
Download Now    Learn More

Main Menu

Breaking multi-fact queries

Started by dougp, 25 Aug 2022 06:13:24 PM

Previous topic - Next topic

dougp

The problem here may stem from everyone here being defined as a "report developer" in that everyone claims to need ad hoc query capability.  I can't change that.

I have a variety of star schemas for different data subjects.  Many of these data subjects are included in the same FM model and package.  With a huge variety of possible combinations, and not knowing the questions ahead of time, I can't seem to design a FM model that will allow simple drag-and-drop capability for all users.  If a user has a need to combine data subjects, they are expected to create multiple queries -- 1 for each namespace -- and deliberately join those queries together in a way that suits their question.  But not all 3000 users have received training, some users refuse to follow instructions, and many users don't know when they should ask for help rather than blindly bulldozing through the data.

I need to prohibit users from asking Cognos to do the heavy lifting to create a multi-fact query.  There seem to be too many questions and too many variables in the data to accommodate "simple" querying involving multiple data subjects.  It seems I can block this by setting the "Outer joins" governor to Deny.

At the same time, I need to allow users to deliberately build reports wherein each query uses only one data subject (namespace) and then join those queries deliberately based on the nature of the question they are trying to answer.  This sometimes requires that the user use outer joins.  This requires that the "Outer joins" governor in the model be set to Allow.

Some dimensions are (or can be) conformed across some of the facts.  Other dimensions are specific to specific facts or a group of facts (subject area).

I have developed a model that has uses the 3-layer (Physical/Business/Presentation) style.  All of the relationships are in the presentation layer.  But when I create a report that uses Fact A and a bunch of dimensions, then I add something from Fact B, I the result in CQM is...
(QueryA) FULL OUTER JOIN (QueryB) on
   (QueryA.dimensionN = QueryB.dimensionN or (QueryA.dimensionN is null and QueryB.dimensionN is null))
  and (repeat for each dimension)


In DQM, Cognos defines columns:
SUM(1) OVER(
      ORDER BY
        FS1_inner.dimension1Column ASC
        FS1_inner.dimension2Column ASC
        FS1_inner.dimension3Column ASC
        FS1_inner.dimensino4Column ASC
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )

...and...
SUM(1) OVER(
      ORDER BY FS2_inner.factColumn ASC
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )

...and joins on those.

We have tried restricting most users to using only packages that contain single namespaces.  This would actually work well for about 98% of my users.  To implement this, there is a security group that is allowed to have access to the more complicated packages.  Unfortunately, about 25% of the users are in this group because of politics.  ( D'oh! )

How can I keep users from dragging query items from multiple namespaces into a single query (meaning: How can I make the report not validate if they do this?), but also allow them to build their queries individually and deliberately join them (sometime using an outer join)?

bus_pass_man

I've never seen a model done with the relationships in the presentation layer shortcuts so I can't   I don't understand why you have done that.  I can't mentally picture your model in other ways either.  The design of the FM model should follow Kimball's 4 rules. 1. Select the business process. 2. Declare the grain. 3. Identify the dimensions. 4. Identify the facts.  You mention other things which confuse me too, such as the SQL examples, which are way off what should be expected.

The data aka query layer exists for two purposes.  One is to store the metadata of the tables of the data sources. The second is to store the query plan in the form of aliases and relationships. This does not need to be in the query layer and can be modelled in the business layer.


If I remember correctly, the tradeoff between having the query plan in the query layer versus the business layer is that the latter buys control over table identification in the SQL at the price of more frequent metadata callbacks.

If you have the plan in the business layer you also need to have the relationships within a dimension (the Kimball relational thing not the thing with hierarchies and levels, which is a concept built on top of the relational thing ) planned so that ambiguity is eliminated. For role-playing you would have a model query subject pointing to a data source query subject representing the table which role-plays within dimensions for each dimension.  For example if you have country table and country is used in your employee and other dimensions you would have a mqs in the employee dimension which would point to the country table and a mqs for any other dimension and a mqs for a third dimension which needs country.



If your ETL has created dimension tables the data would be in all the dimensions that it belongs to so the role-playing is built in.


You then model the determinants on the model query subjects.


The model should be, thus, able to deal with multi-fact queries without any action by a report author.  The query in the report can be and should be formed by dragging the necessary query items into the report.   No action to manually create a query should be necessary.  The Cognos query engine should be able to coalesce the fact streams.  Ordinary users' training should be fairly simple and it should not be necessary to worry about not following the rules as not following the rules should either produce a report which does not run or is not possible. The objects in a FM package should be just like bricks to them-- they throw this and that and a fact in and they get a report.

Quotethey are expected to create multiple queries -- 1 for each namespace -- and deliberately join those queries together in a way that suits their question.
I'm guessing the query creation and joining are done in reporting?  Why is this seen as necessary.


Are your presentation layer objects shortcuts?   If you have multiple star schema groupings and someone pulls a query item from Dimension A in ssg 1 and it is conformed to the fact table of ssg 2 then, since both shortcuts point to the same target, there should not be a problem generating a valid query.




cognostechie

I second bus_pass_man on 'I've never seen a model done with the relationships in the presentation layer shortcuts' but what amazes me is that shortcuts would not allow you to create relationships. It's the 'alias shortcuts' which are used to create relationships for role playing dimensions and alias shortcuts are never used in the Presentation Layer

Another point to note is that is a package should not be published which has query subjects spread out in multiple namespaces. If you do not have a consolidated model which involved multiple facts joined to conformed dimensions (let's presume your data is from different source systems which have no relation to each other) then publish one package for each namespace. This would not prevent them to use those packages in the same report because DQM allows using multiple packages in the same report but at least you will be able to justify saying that they did it wrong (because you told them in advance that multiple packages used in the same report will not have relationships pre-defined).

Also to keep in mind is the fact that whenever you drag and drop items from multiple namespaces which do not have relationships pre-defined then do not worry about outer/inner joins that cognos shows you because in that case, cognos would not join them but do a local processing. It will fetch the data of queries on to cognos app server and then match them using an algorithm. Joins do not come into play here. This would be perfectly fine considering the performance improvement of DQM over CQM and should only be a problem when your data is really huge and performance becomes an issue.

Just like you, I whole-heartedly support self-service BI, integrated model, Kimball methodology but if that is not possible then let the local processing go on until it becomes a problem which the users see themselves. Lessons learnt the hard way will make them learn instead of you trying to be pro-active and trying to prevent problems from occurring. Let the problem come up, let them ask you for help and then tell them where they screwed up. That will work rather than you trying to prevent the disaster. Everything is not technical. Psychology plays a big role. 

dougp

Quoteuntil it becomes a problem which the users see themselves.
Let's just say the problem can present in subtle ways that may not produce obviously wrong results, especially if the reader is not an expert in the business.


Quotethe SQL examples
are what Cognos produces, except that I obfuscated names.


QuoteThe Cognos query engine should be able to coalesce the fact streams.
Apparently, its that feature that produces the SQL I posted.  Unfortunately:

  • I'm not convinced that this always works correctly.  Maybe that has something to do with my relationships or determinants.
  • The (a is null and b is null) part of the join is problematic.  For one, Cognos applies it inconsistently, causing some data to be missing from reports.  If it was applied consistently, it would be 100% reliable at returning the correct data, but it can make queries excruciatingly slow.  I identified that Cognos does this sometimes when you use groups or sections on a list.  When the report is run, the query starts on the database server, but Cognos doesn't kill it when either Cognos or the user gives up.  Maybe the user is closing the browser rather than clicking the Cancel button.  I've had my DBAs implement a scheduled job that identifies any query with that pattern that comes from Cognos and has been running for more than 40 minutes because we had some run for 22+ hours and keep data mart loads from starting.  The same reports, once corrected (using slightly different visuals to produce the same results), run in seconds.


Looks like I have some work to do.

cognostechie

Quote from: dougp on 29 Aug 2022 03:52:13 PM
The (a is null and b is null) part of the join is problematic. 

Nulls in the join will definitely make it fail and it's not just Cognos. Even SQL Server and Oracle don't handle nulls in the join. Looks like you have a bad Data Mart/Data Warehouse or maybe don't even have one, just some tables. ETL is the foundation of BI and if you are trying to use BI to get the solutions without doing what should be done in ETL then that's just a bad idea.

I have seen some other organizations take the same approach. Instead of creating a proper Data Mart, they tell the BI developer to use staging tables and use Cognos to create all kinds of joins/unions etc. The result is always less than satisfactory, cost is high and performance issues remain. 

dougp

Quotebad Data Mart/Data Warehouse

Nope.  Our data mart design is solid.  Cognos does that when I use group, section, or (apparently) when creating a join between two facts that I don't want people joining.  I think it's an attempt to ensure 100% correct data, but it doesn't always work correctly. 

I think the main problem I am trying to work around is that I was required to include all of the facts from of my data marts in one model and make it available through one package.  While this enables talented report developers to incorporate data from multiple subjects in one report a meaningful way, it is difficult to build properly in FM.  My shortcut was to assume that only talented users would use this model, so I could do less FM work and provide a little bit of training.  Unfortunately, everybody wants access to the "everything" package.


So, deviating from this topic...

When I use the "group span" feature, it often results in a query that (boiled down to something readable) changes from...


select wo.WorkOrderNumber AS 'Work Order Number'
, wo.CurrentTitle AS 'Current Title'
, wo.WorkOrderOpenDate AS 'Open Date'
, wo.WorkOrderCloseDate AS 'Close Date'
, sum(fact.ExpenditureAmount) as 'Expenditure Amount'

from ExpenditureMonthlySummaryFact fact
  inner join WorkOrder wo on fact.WorkOrderId = wo.WorkOrderId

where fact.FiscalBienniumId = 2015
  and wo.WorkOrderNumber like '00%'
  and wo.WorkOrderOpenDate between {d '2014-05-01'} and {d '2014-05-31'}

group by wo.WorkOrderNumber
, wo.CurrentTitle
, wo.WorkOrderOpenDate
, wo.WorkOrderCloseDate

order by 1 asc


...to...


select T2.C0 AS C0 -- Work Order Number
, T2.C2 AS C1 -- Open Date
, T2.C3 AS C2 -- Close Date
, T1.C4 AS C3 -- Expenditure Amount
, T2.C1 AS C4 -- Current Title
, T0.C1 AS C5 -- Current Title

from (
select wo.WorkOrderNumber AS C0
, min(wo.CurrentTitle) as C1

from ExpenditureMonthlySummaryFact fact
  inner join WorkOrder wo on fact.WorkOrderId = wo.WorkOrderId

where fact.FiscalBienniumId = 2015
  and wo.WorkOrderNumber like '00%'
  and wo.WorkOrderOpenDate between {d '2014-05-01'} and {d '2014-05-31'}

group by wo.WorkOrderNumber
) T0
, (
select wo.WorkOrderNumber AS C0
, wo.CurrentTitle AS C1
, wo.WorkOrderOpenDate AS C2
, wo.WorkOrderCloseDate AS C3
, sum(fact.ExpenditureAmount) as C4

from ExpenditureMonthlySummaryFact fact
  inner join WorkOrder wo on fact.WorkOrderId = wo.WorkOrderId

where fact.FiscalBienniumId = 2015
  and wo.WorkOrderNumber like '00%'
  and wo.WorkOrderOpenDate between {d '2014-05-01'} and {d '2014-05-31'}

group by wo.WorkOrderNumber
, wo.CurrentTitle
, wo.WorkOrderOpenDate
, wo.WorkOrderCloseDate
) T1
, (
select wo.WorkOrderNumber AS C0
, wo.CurrentTitle AS C1
, wo.WorkOrderOpenDate AS C2
, wo.WorkOrderCloseDate AS C3

from ExpenditureMonthlySummaryFact fact
  inner join WorkOrder wo on fact.WorkOrderId = wo.WorkOrderId

where fact.FiscalBienniumId = 2015
  and wo.WorkOrderNumber like '00%'
  and wo.WorkOrderOpenDate between {d '2014-05-01'} and {d '2014-05-31'}
) T2

where (T2.C0 = T0.C0 or T2.C0 is null and T0.C0 is null)
and (T2.C0 = T1.C0 or T2.C0 is null and T1.C0 is null)
and (T2.C1 = T1.C1 or T2.C1 is null and T1.C1 is null)
and T2.C2 = T1.C2
and T2.C3 = T1.C3

order by 1 asc
, 5 asc
, 2 asc
, 3 asc


Notice the final WHERE clause.  See how C2 and C3 are not joined the same way as the other columns.  That's what I mean by inconsistent.  I had a report that was missing a few rows.  This was why.

This query structure also makes for very long-running queries.

cognostechie

It's definitely NOT a proper DW because in a star schema design DW, you are not supposed to have Date in the Work Order table. A key for the Date should be in the Fact table and it should be joined to a Date Dimension. The filter as well as display of Date would  be on the Date Dimensions instead of Work Order table (With a multi-fact model using conformed dimensions in FM).

Your understanding of joins is also incorrect. C2 and C3 are column names, not table names so there would be no joins between them. The nulls are not used in the joins (as you mentioned earlier) but in the filter which is ok.

I have never had a problem with performance when using grouping and I have a FM Model which has more than 50 Fact tables joined to the same dimensions and that is also for ERP data including manufacturing data. Not all of them are used in the same report but many of them are. Joins and determinants should be defined properly and above all, the Data Mart design should be proper. With Date in the Work Order table you will not be able to relate this to other Facts easily as there would be no common dimension for the date between them. Don't know about how many more but that's one flaw right there.

dougp

I had a long rant written, but let's just say, you're wrong in so many ways.  (For example, putting EmployeeBirthDate as an ID on the fact rather than an attribute on the Employee dimension doesn't seem like proper data mart design to me.)

One think is for sure:  I need to reduce the number of users who can create reports against my multi-fact packages, improve my models by adding way more determinants, or both.  Since the important rarely prioritizes as high as the urgent, I'll probably attack the permissions problem first.  It's faster, so I have some hope of making progress between emergencies.

cognostechie

Quote from: dougp on 30 Aug 2022 05:12:03 PM
I had a long rant written, but let's just say, you're wrong in so many ways.  (For example, putting EmployeeBirthDate as an ID on the fact rather than an attribute on the Employee dimension doesn't seem like proper data mart design to me.)

Obviously you have never heard of Kimball methodology. The Data Warehousing industry has the best practices already defined, mostly by Mr. Ralph Kimball, who is considered the great grandfather of Data Warehousing. The best practices have been made for a reason. You mind is conditioned to think from a source system (snowflake) design perspective and hence you will find this to be wrong, not because it is actually wrong.

Go with what you think best. I am no longer going to post on this thread as there is no point in helping someone who cannot differentiate between a source system design and a DW design, the difference and need between increasing redundancy to reduce joins vs. reducing redundancy to increase joins. Someone who doesn't want to understand why Master/Transaction tables are used in the source systems and why Dimensions and Facts are used in DW.