Clustersystem > Forum

Clusterwartung erfolgreich abgeschlossen


Autor Nachricht
Verfasst am: 06. 06. 2012 [21:47]
cochrane
Paul Cochrane
Themenersteller
Dabei seit: 14.09.2010
Beiträge: 139
Liebe Nutzerin, lieber Nutzer,

die Clusterwartung ist viel schneller abgeschlossen als erwartet. Das gesamte Clustersystem (die Loginknoten auch) steht jetzt wieder zur Verfügung. Die Jobs, die in der Queue gewartet haben laufen jetzt an.

Wir bedanken uns für die Geduld!

Ihr Clusterteam

==========================================================================

Dear user,

the maintenance on the cluster system was able to be completed earlier than expected. The entire cluster system (including the login nodes) is now accessible. The jobs which were waiting in the queue are starting as expected.

Many thanks for your patience!

Your Cluster Team
Verfasst am: 07. 06. 2012 [11:02]
ikmadmin
Jens Bsdok
Dabei seit: 16.01.2012
Beiträge: 1
Hi Paul,

maybe you could explain what caused the trouble (if you know by now),
so that it might be prevented next time.

Best regards, Jens
Verfasst am: 07. 06. 2012 [11:47]
cochrane
Paul Cochrane
Themenersteller
Dabei seit: 14.09.2010
Beiträge: 139
Hi Jens,

really exactly (and to be totally honest) we don't know. However, the file system check found and corrected several errors on the file system, and it looks as though this caused the problem this time. We found that one user's jobs seemed to be causing various nodes to lose contact with the Object Storage Targets (OSTs), however there wasn't anything wrong that he was doing which would have caused the problem. Our guess (we can't find out for certain) is that the hardware errors on the underlying filesystem coincided with the directories and files that this user was accessing and this caused the whole system to barf. There are some things that users can avoid so as not to cause extra (unnecessary) stress for the filesystem (such as not using 'ls -l' in BIGWORK) and Patrick is currently writing a FAQ about this which will then be posted on the cluster system website.

Other than that, our infiniband network showed up bad links and various things from that point of view weren't working as they should have. A restart of the infiniband switches (which we did as part of the maintenance when everything was turned off) corrected several of these problems and we now actually have more compute nodes in the entire cluster system as a result icon_smile.gif We know that there is a bottleneck in the network and to fix this we need to move from DDR to QDR infiniband. We are waiting for the remaining quotes so that we can get the relevant paperwork done and purchase a couple of QDR switches which should relieve network bottleneck somewhat. Long term we plan to completely move to QDR, however this is something which will take time and money. As soon as the first QDR switch is there we'll do a planned maintenance and reorganise the network structure so as to improve that side of things. We're currently also considering using two scratch filesystems so that when we perform maintenance on one, users can use the other and we don't have to turn absolutely everything off. It'll be a while before we an implement this though...

Hopefully that explained some of what happened!

Cheers,

Paul



Benutzeranmeldung

Geben Sie Ihren Benutzernamen und Ihr Passwort ein, um sich an der Website anzumelden:

Registrierung

Falls Sie noch keinen Benutzer-Zugang zu dem Forum haben, können Sie sich jederzeit registrieren:

Letzte Änderung: 12.04.2011
 
Verantwortlich RRZN