CPM System Availability Issues and Mitigation

This page summarizes the outages and data load issues that have been caused since the March go-live date and steps that have been taken to prevent them from happening again:

 

IDIssueMitigation taskoutages caused bydeployment date / status
564A runaway processes possibly initiated by a faulty report, a replication process, or other cause have resulted in the TM1 system slowing way down and preventing the nightly load process from completin

1. Instal Fix pack 3 for TM1 9.5.2, there are several issues that refer to server crashes in the release notes:
[http://www-01.ibm.com/support/docview.wss?uid=swg27036823
]

2. Created method to kill any processes that go beyond 1 hour

May 31, 2013

June 20, 2013

July 24, 2013      

All items completed as of 8/28/2013
581A report could be resulting in a long running process on TM1 resulting in the TM1 system slowing way down and preventing the nightly load process from completing.Set-up Cognos BI audit logging so report can be identified. 9/17/2013
574, 580Some questions that have come up about system stability relate to data entry errors that we cannot trace because logging was randomly turning off in TM1.Investigated root cause of auto-logging turning off on some cubes. Fixed TM1 logging.
Unable to prove that any data was lost at this time.  Please report any issues or concerns as soon as you identify them and we will investigate, starting with the system logs.
 8/28/2013
577Data loads failed when CMM had a date that is out of the SQL server minimum date rangeAddressed in load processApril 25, 20139/9/2013
578Users were expressing that their data disappeared from one day to the next or after an outage.Verified that save data is happening when TM1 does crash, and that all logging is recalled 8/28/2013
579Users were reporting slowness between 3pm and 5pm

1. Added 96GB memory to system 8/13/13 bringing total to 192GB

2. Monitored TM1 server for 2 weeks between 3 and 5pm and saw no sign on the server tha it was experiencing any load issues.  Users should report this behaviour when it occurs. Documentation of what they were doing at the time would help.

 9/9/2013
n/aSource system outages may result in CPM load processes being unable to complete. (June 17 iVantage system was not responding)Please note that there is still risk of data load failures due to issues with source systems or source system data.  In the event that the nightly load fails, IT will work with entity Budget Offices to determine whether we will take the system off line during work hours to re-run the load process or wait until the next day.  This decision and its impacts will be communicated to cpmparticipants in the event of a load issue.June 17, 20139/8/13
n/aPassword not changed on SQL server run credential causing a load job not to start Switch to using non-expiry password

May 13, 2013

May 30, 2013

 May 30, 2013
 575 2 employee id's in iVantage were the same causing a load issue

1. Change the load to rely on the peid rather than the employee name - More far reaching change than anticipated.  Extensive testing needed to deploy change.  Preliminary testing complete. Ready for test now - awaiting testing resources.

2. Explore whether load process can proceed with issue (e.g. by ignoring 2nd PEID) just reporting minor errors.

 Mrch 26,2013Item 1 - Waiting for test