|
Verfasst am: 06. 06. 2012 [21:47]
|
|
cochrane
Paul Cochrane
Themenersteller
Dabei seit: 14.09.2010
Beiträge: 145
|
Liebe Nutzerin, lieber Nutzer,
die Clusterwartung ist viel schneller abgeschlossen als erwartet. Das gesamte Clustersystem (die Loginknoten auch) steht jetzt wieder zur Verfügung. Die Jobs, die in der Queue gewartet haben laufen jetzt an.
Wir bedanken uns für die Geduld!
Ihr Clusterteam
==========================================================================
Dear user,
the maintenance on the cluster system was able to be completed earlier than expected. The entire cluster system (including the login nodes) is now accessible. The jobs which were waiting in the queue are starting as expected.
Many thanks for your patience!
Your Cluster Team
|
|
Verfasst am: 07. 06. 2012 [11:02]
|
|
ikmadmin
Jens Bsdok
Dabei seit: 16.01.2012
Beiträge: 1
|
Hi Paul,
maybe you could explain what caused the trouble (if you know by now),
so that it might be prevented next time.
Best regards, Jens
|
|
Verfasst am: 07. 06. 2012 [11:47]
|
|
cochrane
Paul Cochrane
Themenersteller
Dabei seit: 14.09.2010
Beiträge: 145
|
Hi Jens,
really exactly (and to be totally honest) we don't know. However, the file system check found and corrected several errors on the file system, and it looks as though this caused the problem this time. We found that one user's jobs seemed to be causing various nodes to lose contact with the Object Storage Targets (OSTs), however there wasn't anything wrong that he was doing which would have caused the problem. Our guess (we can't find out for certain) is that the hardware errors on the underlying filesystem coincided with the directories and files that this user was accessing and this caused the whole system to barf. There are some things that users can avoid so as not to cause extra (unnecessary) stress for the filesystem (such as not using 'ls -l' in BIGWORK) and Patrick is currently writing a FAQ about this which will then be posted on the cluster system website.
Other than that, our infiniband network showed up bad links and various things from that point of view weren't working as they should have. A restart of the infiniband switches (which we did as part of the maintenance when everything was turned off) corrected several of these problems and we now actually have more compute nodes in the entire cluster system as a result We know that there is a bottleneck in the network and to fix this we need to move from DDR to QDR infiniband. We are waiting for the remaining quotes so that we can get the relevant paperwork done and purchase a couple of QDR switches which should relieve network bottleneck somewhat. Long term we plan to completely move to QDR, however this is something which will take time and money. As soon as the first QDR switch is there we'll do a planned maintenance and reorganise the network structure so as to improve that side of things. We're currently also considering using two scratch filesystems so that when we perform maintenance on one, users can use the other and we don't have to turn absolutely everything off. It'll be a while before we an implement this though...
Hopefully that explained some of what happened!
Cheers,
Paul
|