--------------------------------------------------------------------------------------------------------------- EGI BROADCAST TOOL : https://operations-portal.in2p3.fr/broadcast/send --------------------------------------------------------------------------------------------------------------- Publication from : Frederic Schaer frederic.schaer@cea.fr ----------------------------------------------------------------------------------------------------------------
Dear VOs and users,
It was found by the CMS experiment that a WN at the GRIF/IRFU site was silently corrupting files (thanks, CMS). After investigations, it appears that a CPU on the machine was silently corrupting files while they were beeing compressed on the machine, only if the compression task was beeing run on core #8 of the CPU socket #0, in addition to it's sibling hyperthreaded core #28.
Unfortunately, this hardware issue remained unnoticed because uncaught by the various hardware and software system checks - neither Dell nor Intel diagnostic tools could find and report it. Unfortunately also, root files seem to be affected. Or at least files created by the CMS software which includes root and recompiled copies of various compression tools. It was found also that files compressed with the "bzip2" system tool was also corrupted, but not files created with the system lzma or gzip tools for instance.
Final bad news : we have no way to identify which files -your files- were produced on that machine.
We would therefore like to warn you about this problem, giving you as much details as possible.
The machine name is : wn328.datagrid.cea.fr The ethernet MAC address of the main ntework interface is : 00:8C:FA:F2:93:1E The host IPs are : 192.54.205.14 (v4) and 2001:660:3031:110:10::328/64 (v6) The host entered production on Sep. 21 @ 9H49. The host is running an up to date SL 6.8
Off course, the host was finally taken out of production (thanks again to cms ;) ) on November 25 2016@10H01 CET time, and the bad cpu should be changed this week.
We would like to apologize for this unwelcome hardware failure, as we already know finding the affected files will be a hard work that you would all have prefered to avoïd. Best regards
The GRIF/IRFU admins
---------------------------------------------------------------------------------------------------------------- link to this broadcast : https://operations-portal.in2p3.fr/broadcast/archive/1591 ----------------------------------------------------------------------------------------------------------------