Resources‎ > ‎

SA 256

Original text by D.Shin, revisions by S. Kirklin
A more in depth, and commensurately more complicated, guide to system administration on our group clusters.

Job management

PBS/torque - pbs_server/pbs_mom

On master node, pbs_server should be running to accept jobs and pbs_mom shoud be running all computing nodes. Please do NOT restart or kill running pbs_server deamons, unless it is really needed. It will reset all running jobs.

josquin ~ # ps aux|grep pbs_server
root 6728 0.0 0.0 15500 3560 ? Ss Jan02 16:39 pbs_server


If PBS related commands, such as qstat, qsub, qalter and etc., are not working, then make sure pbs_server is not working with above command and launch pbs_server command manuall. DO NOT restart entire pbs service with /etc/init.d/pbs restart.

josquin ~ # /usr/local/sbin/pbs_server

Maui

Maui is a scheduler for job management on our clusters. It starts its service at the boot along with pbs_server. If all maui related commands are not working, such as showq, diagnose –f (aliased as fs), diagnose –p (aliased as p), showstart and etc., relaunch maui command by:

josquin ~ # /usr/local/maui/sbin/maui

Fairshare Scheme

Fairshare scheme is applied to all cluster and its settings can be found in maui.cfg file in /usr/local/maui. All jobs will be launched based on the order of priority, which is weighted by many different categories, such as fairshare and resources requested.

NEED TO UPDATE THE FAIRSHARE SCHEME THEN PUT THAT INFORMATION HERE.


Package management

To install/remove/upgrade a program on cluster, you may want to use its package management feature. There is a nice summary on Wikipedia for various package management system on linux.

Victoria

Its OS is CentOS 5.2, and uses rpm, the most common linux package management system. To actually install files, you can 


victoria ~ # yum search scipy
Loading "fastestmirror" plugin
Loading "priorities" plugin
Loading "downloadonly" plugin
Loading mirror speeds from cached hostfile
 * rpmforge: fr2.rpmfind.net
 * base: yum.singlehop.com
 * updates: mirror.sanctuaryhost.com
 * addons: mirror.team-cymru.org
 * extras: pubmirrors.reflected.net
rpmforge                  100% |=========================| 1.1 kB    00:00
base                      100% |=========================| 2.1 kB    00:00
updates                   100% |=========================| 1.9 kB    00:00
addons                    100% |=========================|  951 B    00:00
extras                    100% |=========================| 2.1 kB    00:00
Excluding Packages in global exclude list
Finished
0 packages excluded due to repository priority protections
python-numpy.x86_64 : Fast multidimensional array facility for Python
python-numpy.x86_64 : Fast multidimensional array facility for Python
victoria ~ # yum install python-numpy

Josquin/Byrd/Palestrina
Gentoo is installed on palestrina, josquin and byrd, which uses portage for package management. To access the library of software available:

palestrina ~ # emerge -s scipy
Searching...
[ Results for search key : scipy ]
[ Applications found : 1 ]

*  sci-libs/scipy
      Latest version available: 0.7.2-r1
      Latest version installed: 0.7.2-r1
      Size of files: 13,340 kB
      Homepage:      http://www.scipy.org/ http://pypi.python.org/pypi/scipy
      Description:   Scientific algorithms library for Python
      License:       BSD

palestrina ~ # emerge scipy

Encina

Ubuntu 8.04 server is installed on encina, and it uses APT for package management.

Accessibility

iptables

iptables is a kernel level firewall that blocks an access to a port which is not opened.
On Wolverton clusters, ports other than 22 (ssh), 25 (mail), 80 (http), 443 (https), 3573 (DevMan[2]), are all closed. Rule files are /etc/iptables.bak (josquin, byrd, palestrina) and /etc/sysconfig/iptable.save (victoria).

kaien@josquin ~$ sudo /sbin/iptables -L
Password:
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:3573
DROP all -- anywhere anywhere
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337

Fail2ban

Fail2ban is a program that bans certain ip addresses, if there are more than certain number of malicious attempts and it basically adds more rules to iptables. The configuration file, jail.conf, can be found in /etc/fail2ban directory. 

kaien@josquin /etc/fail2ban $ sudo /sbin/iptables -L 
Chain INPUT (policy ACCEPT) 
target prot opt source destination 
fail2ban-BadBots tcp -- anywhere anywhere multiport dports http,https 
fail2ban-SSH tcp -- anywhere anywhere tcp dpt:ssh 
ACCEPT all -- anywhere anywhere 
ACCEPT all -- anywhere anywhere 
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED 
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh 
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp 
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http 
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:https 
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:3573 
DROP all -- anywhere anywhere 

Chain FORWARD (policy ACCEPT) 
target prot opt source destination 
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337 

Chain OUTPUT (policy ACCEPT) 
target prot opt source destination 
DROP tcp -- anywhere anywhere tcp spt:31337 dpt:31337 

Chain fail2ban-BadBots (1 references) 
target prot opt source destination 
RETURN all -- anywhere anywhere 

Chain fail2ban-SSH (1 references) 
target prot opt source destination 
RETURN all -- anywhere anywhere

/etc/hosts.allow, /etc/hosts.deny 

Access to Wolverton clusters is only allowed from certain ip addresses that are listed in /etc/hosts.allow files. An ip address of a group member can be added to make a hole. 

# 
# hosts.allow This file describes the names of the hosts which are 
# allowed to use the local INET services, as decided 
# by the '/usr/sbin/tcpd' server. 
# 
#sshd: *.northwestern.edu: allow 
#sshd: phasepusan.metsce.psu.edu: allow 
# 

# encina 
sshd: 129.105.92.49: allow 
# byrd 
sshd: 165.124.29.202: allow 
# victoria 
sshd: 165.124.29.204: allow 
# morales 
sshd: 129.105.12.20: allow 
# guerrero 
sshd: 129.105.12.19 : allow 
# tallis 
sshd: 165.124.29.197: allow 
# quest 
sshd: 165.124.130.5: allow 
sshd: 165.124.130.6: allow 
sshd: 165.124.130.7: allow 
sshd: 165.124.130.8: allow 

Services

Linux provides certain services for users, such as web, ssh, and etc. They can be start/stop/restart by: 
$ /etc/init.d/service_name [start/stop/restart/status] 

Web via apache2 server 
$ /etc/init.d/apache2
(josquin/byrd/palestrina) 

$ /etc/init.d/httpd
(victoria) 

SSH (Secure shell) 
$ /etc/init.d/ssh 

Nodewatch 
$ /etc/init.d/ssh 

Ganglia 
$ /etc/init.d/gmond  
(nodes)

$ /etc/init.d/gmetad
(master)

Pathscale subscription server
There is only one seat for pathscale compiler suite, and encina is serving as the license server. The license file is /opt/pathscale/lib/3.2/pscsubscription-7104.xml. 

$ /etc/init.d/pathsub
(only on encina)