Can I replace outlier values with an Average() of all other values?

psrpsrpsr · 10 Oct 2016 03:10:37 PM

Hi folks, I have a dataset that has major outliers that I want to replace with an average of all the other values. For example:

YearMonth Location ID Count
201602 1234 345
201603 1234 333
201604 1234 18,455 <<< I would like this value replaced with the average of other values for that Location ID (=342.67 in this case)
201605 1234 350

I would like to test and replace the outlier above using a Boolean test (or whatever else is a best practice). I know how to find the values I want to modify using a CASE statement, but I'm not sure how to subsequently remove that value from the average

Here's what I assume would be the CASE statement psuedo-code:

CASE
WHEN COUNT >= 5000 THEN AVERAGE( REPLACE(([Count] FOR [Location ID], [YearMonth]), >=5000) )

Any feedback is welcome. Thank you!

psrpsrpsr · 12 Oct 2016 07:46:33 AM

Any thoughts on this, folks? How do others deal with outlier values in this way?

hespora · 13 Oct 2016 03:22:23 AM

If I understood the requirement correnctly, this should be what you want:

Code Select

case when [cnt] >= 5000 then 

total (
case when [cnt] >= 5000 then 0 else [cnt] end
for [locid]
)
/
total (
case when [cnt] >= 5000 then 0 else 1 end
for [locid]
)

else [cnt]
end

Note that this is a separate field, this is *not* replacing values for your data item.

psrpsrpsr · 13 Oct 2016 08:55:02 AM

Thank you hespora! This got me 90% of the way there. I was getting errors thrown that I believe were due to the FOR clause coming before the 'End' of each CASE.

I really appreciate it!

hespora · 13 Oct 2016 09:37:24 AM

that's weird tho; this was an exact copy&paste of the data item as i built it. Anyways, glad I could help! :)

psrpsrpsr · 13 Oct 2016 12:06:16 PM

Hi Hespora, it looks like I spoke too soon. Allow me to clarify my objective: I want to average all values belonging to a Location ID and YearMonth EXCEPT FOR the offending value. Please see attached image for clarification.

In the query calculation you sent, I believe the numerator is assigning a zero to all of the offending values, when the objective is instead to return the average of all other values that are not outliers. Here is the table again, and I'll walk through how I understand the logic of your statement:

YearMonth Location ID Count
201602 1234 345
201603 1234 333
201604 1234 18,455 <<< I would like this value replaced with the average of other values for that Location ID (=342.67 in this case)
201605 1234 350

~~~~~~~~~~~~~~~~
case when [cnt] >= 5000 then <<< 201604 MEETS CASE

total (
case when [cnt] >= 5000 then 0 else [cnt] end <<< 201604 MEETS CASE AGAIN, SO A ZERO IS RETURNED
for [locid]
)
~~~~~~~~~~~~~~~~

So, here's a summary of my understanding of the issue: Targeting the offending values by using "CASE WHEN [count] >= 5000" is essentially looking at the aggregated values behind the aforementioned 201604 value of 18,455. There are 18,455 rows that are being rolled up into one. I want to disregard all of these individual rows –
1.) Can I operate on the context of various groups of fields in Cognos?
2.) Is there a similar function in Cognos as the GROUP BY / HAVING clauses in SQL? That way I could exclude those grouped rows having a count >5000.

THANKS!

psrpsrpsr · 13 Oct 2016 12:06:58 PM

Here the image I was referring to:

hespora · 14 Oct 2016 02:20:32 AM

Hi psr,

first off, my data item works only when the counts are already summed up per YearNo and LocID. If your count is an actual value on the database, great. If your count, however, is a calculation on the query we are talking about, then you need to create a subquery referencing your counting query. From your screenshot, I cannot really determine which is the case.

Secondly, from what you wrote, I dont think you got the entire definition of my data item (as you stopped midway). Let's go through it: (I'm ditching the code tags now, as those apparently can't be color formatted)

case when [cnt] >= 5000 then

total (
case when [cnt] >= 5000 then 0 else [cnt] end
for [locid]
)
/
total (
case when [cnt] >= 5000 then 0 else 1 end
for [locid]
)

else [cnt]
end

- The red part sums up all the values of [count] for [locid] which are not outliers. Outliers are valued as zero and as such do not affect the sum.
- The orange part counts all the values of [count] for [locid] which are not outliers.
- Therefore, the red, green and orange parts together make up the average of all non-outlier values of [count] for [locid]
- now, the outer blue part only takes this calculation if the count value of the row is an outlier. If it is not, the original count value is used.

Please see screenshot attached. The field "avg of non-outliers" is the above definition, minus the blue parts. The field "count replaced if outlier" is the exact definition above.

psrpsrpsr · 14 Oct 2016 02:38:12 PM

Wow, thank you for the detailed answer. I will give this a try next Monday and let you know the result. Thanks again - so awesome to have a forum like this where n00bs like me can get educated :)

psrpsrpsr · 18 Oct 2016 02:59:48 PM

Hi Hespora, I'm stumped. The attached image shows the Data Items in my query, and snapshots of the logic from your last (very detailed!) post applied to my data, which is not giving me the expected values, although I believe my logic and syntax is

I broke out your red/orange/blue logic into columns for the numerator, denominator, and the whole formula.

My main questions are:
1.) When you say:

Quote from: hespora on 14 Oct 2016 02:20:32 AMIf your count is an actual value on the database, great. If your count, however, is a calculation on the query we are talking about, then you need to create a subquery referencing your counting query. From your screenshot, I cannot really determine which is the case.

...do you mean: "If your count is a fact dimension in the framework model, then this calculation will work. If the count is derived from a CASE statement, (e.g. TOTAL(CASE WHEN value = x THEN 1 ELSE 0), then I need a separate subquery?

If I have paraphrased correctly, then the answer is: yes, the count is an actual fact field in the database.

2.) How could the numerator be returning the sum [Fact Count] for the entire club INCLUDING the outlier value, when it CLEARLY meets the Boolean test of being >= 5000, therefore it should be 'zeroed out'?!?!

Thanks for any and all input!

hespora · 21 Oct 2016 03:22:13 AM

Hi psr,

sry for the late reply, had other things to deal with for a couple days.

Looking at your data, especially the denominator value rather than the numerator, tells me that your fact count must be an aggregate. The row level on your database is something smaller than YearMonth AND Location ID. The calculations we created, however, are done on row level rather than aggregate level, and so, apparently on row level, Fact Count never is >= 5000.

I find The quickest and easiest way to fix this, rather than amending the formulas, really is to create a subquery. Create a query1 with just YearMonth, location ID, and fact count. Remove all calculations.
Then create a query2 which should point to query1 instead of to the data source. In that query2, create your calculations.

hespora · 21 Oct 2016 05:45:28 AM

Never mind, I suspect (untested, and cannot test right now) this should also work:

case when total ([cnt] for [locid],[yearmonth]) >= 5000 then

total (
case when total ([cnt] for [locid],[yearmonth]) >= 5000 then 0 else total ([cnt] for [locid],[yearmonth]) end
for [locid]
)
/
total (
case when total ([cnt] for [locid],[yearmonth]) >= 5000 then 0 else 1 end
for [locid]
)

else total ([cnt] for [locid],[yearmonth])
end
Red and orange *probably* can be optimized, but I don't have access to a system right now where I could play around.

psrpsrpsr · 21 Oct 2016 08:18:34 AM

The additional context using the FOR clause inside the Total() functions worked!!!! You are a scholar and a gentleman, thank you kindly for your help!

hespora · 21 Oct 2016 09:49:08 AM

Glad I could help! :)

Lynn · 21 Oct 2016 09:55:02 AM

Very well done indeed! Take an applaud from me :)

BSP Software	Resources	About Us
MetaManager	BSP Software Training	BSP Software
Integrated Control Suite	YouTube Channel	Micro Strategies Inc
Security Migration		IBM Cognos
Integrated Management Suite

If you are unable to create a new account, please email support@bspsoftware.com

COGNOiSe.com - The IBM Cognos Community

News:

Can I replace outlier values with an Average() of all other values?

psrpsrpsr

psrpsrpsr

hespora

psrpsrpsr

hespora

psrpsrpsr

psrpsrpsr

hespora

psrpsrpsr

psrpsrpsr

hespora

hespora

psrpsrpsr

hespora

Lynn