--------------------------------------------------------------------------------------------------------------- EGI BROADCAST TOOL : https://operations-portal.egi.eu/broadcast
--------------------------------------------------------------------------------------------------------------- Publication from : Christos Triantafyllidis ctria@grid.auth.gr Targets : VO managers/vlemed vlemed-vo-managers@biggrid.nl ----------------------------------------------------------------------------------------------------------------
Dear all, we have planned the update of the PROD broker network software for next week on Tuesday 29/11 at 11.00 UTC time.
The update procedure needs to completely stop the network (all the brokers) and restart it in the new version. While this operation normally should take a few minutes, we schedule a DT for 2 hours to be on safe side.
What will be upgraded? We are upgrading from ActiveMQ 5.3 to ActiveMQ 5.5, this resolves many issues that we have faced and reported in past. Major issues that will be solved by this release: - The periodical need to clean the persistent data store (removal of messages that haven't been consumed either because no-one needs them or because their consumer is dead) of the brokers in order to be able to restart them. - Ability to use wildcard subscription instead of CAMEL (which was a single point of failure)
Why now? We have tested the new version on the TEST broker network during the last month and we are happy with the results. Our plan was to have the new version of ActiveMQ software deployed before the end of the year. Deploying it within November will allow us to monitor the service closely before Christmas vacations.
Who is affected? Every producer or consumer of the PROD broker network. SAM as the key user of the broker network uses the broker network for the following functions: a) Retrieval of gLite WN results from WNs to sam-nagios box This is the only operation that will be actually affected. Given that WNs will fail to either find or connect to the broker network, they will not send results back to the sam-nagios box thus no alarm should be raised. On the other hand the job that wraps the WN probes will probably timeout which may cause an alarm on CE/CREAM-CE jobSubmit probes. Normally sam-nagios boxes schedule one job submission per hour and the alarm is raised on the second WARN/CRIT result so this should also not raise any alarm.
b) Submission of SAM results from sam-nagios boxes to central DBs The sam-nagios boxes will fail to either find or connect to the broker network during the DT period. All results are cached locally at a directory queue on sam-nagios boxes thus if they fail to send them, they'll try later. Results are timestamped thus this should not create any issue except of delays
c) Submission of SAM results from sam-nagios boxes to operation portal(s) As with central DBs, SAM results will stay in directory queue. Given that a sam-nagios may connect to the broker network before operation portal(s) after the upgrade, this may create a time window in which messages may not be consumed. Operation portal and SAM team have already implemented an synchronization procedure which will fix such issues.
d) Other applications Other applications that might unofficially use the PROD network without registering could also be affected. As we don't know their usage pattern, we cannot perform risk analysis.
Thank you for your understanding, Christos Triantafyllidis EGI message broker task co-ordinator
----------------------------------------------------------------------------------------------------------------------------------------------------------------------- link to this broadcast : https://operations-portal.egi.eu/broadcast/archive/id/547 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------