Dear users of the NDPF,
The compute service at the NDPF is fully operational again. The inflow of cold air has been restored and this has allowed us to re-able sub-clusters previously switched-off to conserve power and heat.
By this event, 325 jobs have been terminated, of which approx. 200 Atlas jobs and 125 LHCb jobs. We expect that the community frameworks will or already have re-submitted these jobs either to Nikhef or another site within the e-Infrastructure.
We apologize for the inconvenience caused by this event.
Regards, David Groep.
David Groep wrote:
Due to failure of the cooling system, a large fraction of our worker nodes has been switched off. Running jobs on the subclusters "bulldozer" and "luilak" (both luilak-1 and luilak-2) have been forcibly terminated and any jobs running on those nodes have been lost. This mainly affects Atlas and LHCb production jobs.
We will attempt to keep some nodes running, but may be forced to switch off more nodes to keep the temperature lower...