If you are unable to create a new account, please email support@bspsoftware.com

 

News:

MetaManager - Administrative Tools for IBM Cognos
Pricing starting at $2,100
Download Now    Learn More

Main Menu

Can I replace outlier values with an Average() of all other values?

Started by psrpsrpsr, 10 Oct 2016 03:10:37 PM

Previous topic - Next topic

psrpsrpsr

Hi folks, I have a dataset that has major outliers that I want to replace with an average of all the other values. For example:

YearMonth    Location ID      Count
201602           1234                   345
201603           1234                   333
201604           1234                   18,455    <<< I would like this value replaced with the average of other values for that Location ID (=342.67 in this case)
201605           1234                   350

I would like to test and replace the outlier above using a Boolean test (or whatever else is a best practice). I know how to find the values I want to modify using a CASE statement, but I'm not sure how to subsequently remove that value from the average

Here's what I assume would be the CASE statement psuedo-code:

CASE
WHEN COUNT >= 5000 THEN AVERAGE( REPLACE(([Count] FOR [Location ID], [YearMonth]), >=5000) )

Any feedback is welcome. Thank you!

psrpsrpsr

Any thoughts on this, folks? How do others deal with outlier values in this way?

hespora

If I understood the requirement correnctly, this should be what you want:

case when [cnt] >= 5000 then

total (
case when [cnt] >= 5000 then 0 else [cnt] end
for [locid]
)
/
total (
case when [cnt] >= 5000 then 0 else 1 end
for [locid]
)

else [cnt]
end


Note that this is a separate field, this is *not* replacing values for your data item.

psrpsrpsr

Thank you hespora! This got me 90% of the way there. I was getting errors thrown that I believe were due to the FOR clause coming before the 'End' of each CASE.

I really appreciate it!

hespora

that's weird tho; this was an exact copy&paste of the data item as i built it. Anyways, glad I could help! :)

psrpsrpsr

Hi Hespora, it looks like I spoke too soon. Allow me to clarify my objective: I want to average all values belonging to a Location ID and YearMonth EXCEPT FOR the offending value. Please see attached image for clarification.

In the query calculation you sent, I believe the numerator is assigning a zero to all of the offending values, when the objective is instead to return the average of all other values that are not outliers. Here is the table again, and I'll walk through how I understand the logic of your statement:

YearMonth    Location ID       Count
201602           1234                   345
201603           1234                   333
201604           1234                   18,455    <<< I would like this value replaced with the average of other values for that Location ID (=342.67 in this case)
201605           1234                   350

~~~~~~~~~~~~~~~~
case when [cnt] >= 5000 then     <<< 201604 MEETS CASE

total (
case when [cnt] >= 5000 then 0 else [cnt] end      <<< 201604 MEETS CASE AGAIN, SO A ZERO IS RETURNED
for [locid]
)
~~~~~~~~~~~~~~~~

So, here's a summary of my understanding of the issue: Targeting the offending values by using "CASE WHEN [count] >= 5000" is essentially looking at the aggregated values behind the aforementioned 201604 value of 18,455. There are 18,455 rows that are being rolled up into one. I want to disregard all of these individual rows –
1.) Can I operate on the context of various groups of fields in Cognos?
2.) Is there a similar function in Cognos as the GROUP BY / HAVING clauses in SQL?
That way I could exclude those grouped rows having a count >5000.

THANKS!

psrpsrpsr


hespora

Hi psr,

first off, my data item works only when the counts are already summed up per YearNo and LocID. If your count is an actual value on the database, great. If your count, however, is a calculation on the query we are talking about, then you need to create a subquery referencing your counting query. From your screenshot, I cannot really determine which is the case.

Secondly, from what you wrote, I dont think you got the entire definition of my data item (as you stopped midway). Let's go through it: (I'm ditching the code tags now, as those apparently can't be color formatted)

case when [cnt] >= 5000 then

   total (
      case when [cnt] >= 5000 then 0 else [cnt] end
      for [locid]
   )

   /
   total (
      case when [cnt] >= 5000 then 0 else 1 end
      for [locid]
   )


else [cnt]
end


- The red part sums up all the values of [count] for [locid] which are not outliers. Outliers are valued as zero and as such do not affect the sum.
- The orange part counts all the values of [count] for [locid] which are not outliers.
- Therefore, the red, green and orange parts together make up the average of all non-outlier values of [count] for [locid]
- now, the outer blue part only takes this calculation if the count value of the row is an outlier. If it is not, the original count value is used.

Please see screenshot attached. The field "avg of non-outliers" is the above definition, minus the blue parts. The field "count replaced if outlier" is the exact definition above.

psrpsrpsr

Wow, thank you for the detailed answer. I will give this a try next Monday and let you know the result. Thanks again - so awesome to have a forum like this where n00bs like me can get educated :)

psrpsrpsr

Hi Hespora, I'm stumped. The attached image shows the Data Items in my query, and snapshots of the logic from your last (very detailed!) post applied to my data, which is not giving me the expected values, although I believe my logic and syntax is

I broke out your red/orange/blue logic into columns for the numerator, denominator, and the whole formula.

My main questions are:
1.) When you say:
Quote from: hespora on 14 Oct 2016 02:20:32 AMIf your count is an actual value on the database, great. If your count, however, is a calculation on the query we are talking about, then you need to create a subquery referencing your counting query. From your screenshot, I cannot really determine which is the case.
...do you mean: "If your count is a fact dimension in the framework model, then this calculation will work. If the count is derived from a CASE statement, (e.g. TOTAL(CASE WHEN value = x THEN 1 ELSE 0), then I need a separate subquery?

If I have paraphrased correctly, then the answer is: yes, the count is an actual fact field in the database.

2.) How could the numerator be returning the sum [Fact Count] for the entire club INCLUDING the outlier value, when it CLEARLY meets the Boolean test of being >= 5000, therefore it should be 'zeroed out'?!?!

Thanks for any and all input!






hespora

Hi psr,


sry for the late reply, had other things to deal with for a couple days.

Looking at your data, especially the denominator value rather than the numerator, tells me that your fact count must be an aggregate. The row level on your database is something smaller than YearMonth AND Location ID. The calculations we created, however, are done on row level rather than aggregate level, and so, apparently on row level, Fact Count never is >= 5000.

I find The quickest and easiest way to fix this, rather than amending the formulas, really is to create a subquery. Create a query1 with just YearMonth, location ID, and fact count. Remove all calculations.
Then create a query2 which should point to query1 instead of to the data source. In that query2, create your calculations.

hespora

Never mind, I suspect (untested, and cannot test right now) this should also work:

case when total ([cnt] for [locid],[yearmonth]) >= 5000 then

   total (
      case when total ([cnt] for [locid],[yearmonth]) >= 5000 then 0 else total ([cnt] for [locid],[yearmonth]) end
      for [locid]
   )

   /
   total (
      case when total ([cnt] for [locid],[yearmonth]) >= 5000 then 0 else 1 end
      for [locid]
   )


else total ([cnt] for [locid],[yearmonth])
end

Red and orange *probably* can be optimized, but I don't have access to a system right now where I could play around.

psrpsrpsr

The additional context using the FOR clause inside the Total() functions worked!!!! You are a scholar and a gentleman, thank you kindly for your help!

hespora


Lynn