If you are unable to create a new account, please email support@bspsoftware.com

 

How often do you restart your Cognos services?

Started by mjcotter, 29 May 2024 09:52:55 AM

Previous topic - Next topic

mjcotter

Lately, we have been experiencing some random CPU/RAM spikes that have caused performance issues.  Currently we restart the services monthly.  These spikes have occurred 3 times in the last 6 or so weeks.  For a week or 2 after these issues, we got into the habit of restarting the services and experienced no further issues.  For the past couple of weeks, we have not restarted.

Wondering how frequently people are restarting their Cognos services?

I plan to dig into finding an "automated" way to accomplish this, if we decide to do so more frequently.  We own MetaManager, including the Performance module, so I will look there first.  Otherwise, will explore something via PowerShell or Batch.

Thanks in advance for any insight!!

dougp

This is a bit of an XY problem.  It seems you should identify the source of the problem before assuming a service restart will help.

What version of Cognos Analytics?
I had a problem in my Cognos environment that was intermittently causing CPU usage spikes.  I had identified the problem, but there was nothing I could do about it.  An upgrade eventually fixed it.

What has changed in your environment recently?
For example, I started having some severe problems with Cognos while Cognos was on on prem servers and our databases, domain controllers, etc. were migrating from on prem to cloud services, whether a lift-and-shift to cloud-based VMs or into SaaS services.  Problems continued, and are ongoing, because we are not yet fully migrated.

Technically, a service restart is not a difficult task.  I have a scheduled maintenance window that is automated weekly.  I'm on Windows, so I use the Task Scheduler and PowerShell to perform this.  But a cron job on Linux would allow the same functionality.

mjcotter

I agree with you that understanding the source of the problem would be ideal.  It's just not likely.  The "issue" occurred 3 times in April and that's it.  We opened a support case with IBM and the analyst suggested adding more RAM.  Our Dispatcher servers have 64GB currently.  Our Engineering team is pushing back on the RAM request because over a 7 week span, the average RAM usage on those boxes was 42% used (range was 4%-82%).  IBM said the issue is likely related to a report. 

We have been on 12.0.1 since mid-December.  We are an on-prem installation on Windows virtual servers.
 Distributed environment (2 gw servers, 2 CM servers, 2 dispatcher servers). Nothing has changed in the Cognos environment. 

The idea behind this thread is really to understand if there is a general consensus on how often to restart the Cognos services.  We are doing it monthly and have done it monthly through Cognos 10, 11, and now 12. 

Thanks.

dougp

tl;dr
I don't know about consensus, but what I'm doing seems to be working.


Life Story
12.0.1?

Cognos has enough intermittent problems to use a risky, half-baked version in production.  That's why I use only LTS releases.  Here's an example:

I upgraded from 10.2.1 to 11.0.4, an early release in version 11.0.  I was seeing spikes in RAM usage and saw that one reporting process was stuck.  There was a problem with one report that was holding up the queue for that process.  I have 4 reporting processes and the CPU usage would flatline at 25% per problem.  For the most part, it wasn't a huge deal.  Since I have 4 reporting processes, when one got stuck everything after would go around it.

One day our financial folks were presenting training -- the same training they had been doing for quite some time.  The entire class took the same action on the same report:  Filter | Create filter with a specific column selected.  It turns out that the Filter Condition dialog had trouble loading the available values box for a column that has millions of distinct values.  But the user experience was typically:  While the box is loading, add the value manually, OK, get on with my day and forget about it.  But Cognos would not kill the process after the OK or Cancel button was clicked.  It was still trying to load that Available Values list.  So when 10 people did this all at once, with only 4 processes available, my CPU usage hit 100%.  Cognos was unusable, but nobody bothered to tell me for almost an hour.  A little surprising with typically 250 users per day at that time.  I tried to investigate the problem, but nothing was responding.  The solution was to physically unplug the machine from the electrical grid.

That's the day I began monitoring resource usage at the OS level.

This problem persisted through 11.0.3 (LTS) because IBM was unable to reproduce the problem on their end.  At some point it got fixed.  It was no longer a problem in 11.1.7 (LTS).

So, I'm currently using 11.2.4FP3 and will remain on 11.2.4 until 12.0 reaches LTS.  Then I'll test thoroughly before deciding whether it's good to use.  And I'll probably still miss volume-related problems or edge cases that users could run into that I can't imagine.