Vallard's Tech Notes
Enterprise Datacenter Management Voodoo
Enterprise Datacenter Management Voodoo
Sep 18th
Just got my MacBook Pro. Here’s what I install on it:
- iphone SDK
- xcode
- mac ports
- Firefox!
- VMware Fusion
Will update this list as I get more acquanted with this guy.
Sep 17th
I’ve been playing on site with a customer’s x3650 M2, getting it ready to install GPFS. There are a few notes on this machine:
First impression is that this thing boots like a p-Series machine. In other words, it takes forever to boot! First if you plug it in you have to wait about 3 minutes for it to become active. Then once you power it on, you need to wait a long time for it to boot up. (time this)
I found myself staring at the boot screen for a long time.
Since the machine was plugged into a QLogic HBA and had all sorts of paths, that took even longer to boot. I’m not talking about the Linux part where the init starts walking through everything. I’m talking about before we get there. AND, when I got there, the local hard drive could not be found!!!
Answer: It was a bug in the QLogic firmware. So we had to update it. That was not easy nor was it fun.
The package you download to do it is here:
http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/Product_detail.aspx?oemid=376
We got 2.04af at this time.
But how to update? Would be nice if they had a bin image that run under Linux, but that wasn’t the case. The solution was to install SAN Surfer and update it that way.
Looking for the eUFI shell? Its not there. Too dangerous, so IBM doesn’t ship it. This is probably a good idea given the support nightmare you could cause yourself, but not when 3rd party tools tell you to boot up into the eUFI. That just causes more confusion.
After eUFI was updated, boot time improved. But I think we could do better here.
The silver lining? This is one solid box. Performance is outstanding and a good quality piece of work.
Sep 16th
I have a cluster where GPFS is running over the Gb Ethernet. To make it so that GPFS goes over InfiniBand Ido the following:
mmshutdown -a mmchnode --daemon-interface=compute001-ib0 -N compute001
That did it for one node. To do it for all nodes:
for i in $(nodels compute); do mmchnode --daemon-interface=$i-ib0 -N $i; done
Alternatively, I could have just made a spec file called /tmp/foo with the contents:
s01 --daemon-interface=s01-ib0 s02 --daemon-interface=s02-ib0
Then run:
mmchnode -S /tmp/foo
Once finished, I run mmlscluster and see that all the nodes that I wanted are now communicating over the InfiniBand. GPFS is so easy!
Then, start things back up:
mmstartup -a
And you’re off to the races.
Sep 14th
for some reason I’ve had to do a lot of updates lately of OSes…
SLES 10 SP1 is 2.6.16.53-0.16 kernel for my ppc nodes.
SLES 10 SP2 is 2.6.16.60-0.21
I did my update via yast.
This URL had some helpful info:
http://www.novell.com/support/viewContent.do?externalId=7000387
Trick: Have to do a symbollic link:
cd /install/sles10.2/ppc/1/patches
ln -s ../suse .
Then you put this directory in yast:
http://mgmt/install/sles10.2/ppc/1/patches
Here is the yast configuration portion on the nodes you are updateing:
Software -> Installation Source -> Add –> specify URL -> http://<mgmt>/install/sles10.2/ppc64/1/patches
You’ll also need to add the SLES 10 SP2 base DVD in there as well. Do the above, but put in http://<mgmt>/install/sles10.2/ppc64/1/ (no patches)
If you get that then you’ll have to accept the license.
Since mine had the old SP1 in the installation source I got rid of that. The easiest was to use zypper:
c670e2p6:~ # zypper sl
# | Enabled | Refresh | Type | Name | URI
–+———+———+——+—————————————+————————————————
1 | Yes | Yes | YUM | SUSE_SLES_SP2-10.2-18-20090915-023741 | http://c670ep1/install/sles10.2/ppc64/1/patches
2 | Yes | Yes | YaST | SUSE Linux Enterprise Server 10 SP1 | http://9.114.95.67/install/sles10.1/ppc64/1
3 | Yes | Yes | YaST | SUSE Linux Enterprise Server 10 SP2 | http://c670ep1/install/sles10.2/ppc64/1
I removed the #2 entry: zypper sd 2
From there I went back into yast and did Software -> Online Update
After stumbling through install directories a few times I finally was able to update everything. I think yast is non-optimal and I’m a much bigger fan of Yum.
Sep 11th
Today I compiled NWChem on my IBM Intel InfiniBand cluster. I have no idea if my performance is optimal, but I do know that it works.
Here is the secret to my success:
First, you have to set some path variables to get things running. This is set in my home directory:
cat ~/.bashrc
INTELCCROOT=/home/appls/compilers/intel/11.0/083
INTELFCROOT=/home/appls/compilers/intel/11.0/081 PGROOT=/home/appls/compilers/pgi PGCC=$PGROOT/linux86-64/8.0-5/bin PGFLEXLM=$PGROOT/linux86-64/8.0/bin LM_LICENSE_FILE=$LM_LICENSE_FILE:$PGROOT/license.dat #MPI_HOME=/home/appls/openmpi/gcc #MPI_HOME=/home/appls/openmpi/pgi MPI_HOME=/home/appls/openmpi/intel NWCHEM=/home/appls/QChem/NWChem/intel/bin PATH=$PATH:$MPI_HOME/bin:$HOME/bin:$INTELCCROOT/bin/intel64:$INTELFCROOT/bin/intel64:$PGCC:$PGFLEXLM:$NWCHEM LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$MPI_HOME/lib:$INTELCCROOT/lib/intel64:$INTELFCROOT/lib/intel64
Unfortunately, I’m leaving out a lot of the details here, but I used the Intel compilers and had previously built openmpi. The way I built openmpi was:
./configure --prefix /home/appls/openmpi/intel CC=icc CXX=icpc F77=ifort FC=ifort make -j8 make install S
http://wiki.cse.ucdavis.edu/support:hpc:software:nwchem
http://www.mcsr.olemiss.edu/appssubpage.php?pagename=nwchem.inc
So, after I got that set up, I did the normal:
tar zxvf nwchem-5.1.tar.gz
Then I made a script that pretty much did all the work. The script is called makeit.sh:
export TCGRSH=/usr/bin/ssh export NWCHEM_TOP=/home/vallard/qchem/nwchem-5.1/ export NWCHEM_TARGET=LINUX64 export USE_MPI=y export USE_MPIF=y export MPI_LOC=/home/appls/openmpi/intel export MPI_LIB=$MPI_LOC/lib export LIBMPI="-L $MPI_LIB -lmpi -lopen-pal -lopen-rte -lmpi_f90 -lmpi_f77" export MPI_INCLUDE=$MPI_LOC/include export ARMCI_NETWORK=OPENIB export LARGE_FILES export NWCHEM_MODULES=all export FC=ifort export CC=icc cd $NWCHEM_TOP/src make CC=icc FC=ifort -j4
After kicking that off, it ran for almost 20 minutes compiling! Forever! I saw a lot of vector loop messages, but gave them no head, and fearlessly pressed forward.
After it was done compiling as root, I did the following:
[root@mgt vallard]# export NWCHEM_TOP=/home/vallard/qchem/nwchem-5.1/ [root@mgt vallard]# mkdir $NWCHEM/bin [root@mgt vallard]# mkdir $NWCHEM/data [root@mgt vallard]# cp /home/vallard/qchem/nwchem-5.1/bin/LINUX64/nwchem $NWCHEM/bin ^[[root@mgt vallard]# cp /home/vallard/qchem/nwchem-5.1/bin/LINUX64/depend.x $NWEM/bin/ [root@mgt vallard]# cd $NWCHEM_TOP/src/basis [root@mgt basis]# cp -r libraries $NWCHEM/data/ [root@mgt basis]# cd $NWCHEM_TOP/src/ [root@mgt src]# cp -r data $NWCHEM [root@mgt src]# cd $NWCHEM_TOP/src/nwpw/libraryps [root@mgt libraryps]# cp -r pspw_default $NWCHEM/data/ [root@mgt libraryps]# cp -r paw_default/ $NWCHEM/data/ [root@mgt libraryps]# cp -r TM $NWCHEM/data/ [root@mgt libraryps]# cp -r HGH_LDA $NWCHEM/data/
That got everything in place. When done, I went to my compute node and tried one of the examples:
cd /home/vallard/qchem/nwchem-5.1/examples/dirdyvtst/h3 mpirun -np 32 -machinefile machinefile nwchem h3tr2.nw
After that it all seemed to work. Any optimization info would be great! Thanks
Sep 11th
One of the issues I’ve run into many times is when you kickstart a storage node attached to fiber channel, then it will start installing /dev/sda1 on the disk LUN. I blew away a huge storage partition by doing this on accident. Yikes! When I took my RHCE class the instructor just looked bewildered and had no idea how to solve it. The solution is obvious: Remove the drivers during the %pre script. This link shows how to do it:
%pre
#!/bin/sh
# This will remove the loaded HBA modules from the kernel
remove_qla(){
for i in $(lsmod | grep qla | awk ‘{print $1′}); do
echo Will remove: $i >> /dev/tty1
rmmod $i
sleep 1
done
}
remove_lpfc(){
for i in $(lsmod | grep lpfc | awk ‘{print $1′}); do
echo Will remove: $i >> /dev/tty1
rmmod $i
sleep 1
done
}
remove_qla
sleep 2
remove_qla
remove_lpfc
This script comes from http://communities.vmware.com/message/1272854#1272854
5 years later, a solution emerges. Its been there all along, but nobody I knew could solve it.
Many people are paranoid anyway and will not use this because like the thread states: There’s nothing worse than blowing away a LUN. And yes, I’ve done it. At the US Army Laboratories of all places!
Sep 8th
After wasting two days trying to figure out why my x3650 M2 would NOT boot off the hard drive when the fiber connections were on my QLogic HBA, I searched my IBM help list. As usual, the internal labs were not very helpful due to the fact that they didn’t have the equipment. The technical community of IT specialist were. The issue is that the QLogic HBAs need a firmware upgrade to deal with eUFI. After updating this, the machine boots. I had two accounts with this issue.
Sep 4th
Many times I see a cluster set up with xCAT with a split brain idea of how the /etc/resolv.conf file is to be set up on the head node.
People say: I want the head node to connect externally, so I need to have the nameservers in that file point to the external name servers.
This is sound logic, but then the node can’t resolve the IP addresses in the internal network of the cluster. Sometimes, I see people say the way to get around this is to put the head node in /etc/resolv.conf as well. But this just doesn’t work quite right.
The way that works best is to do the following:
1. In /etc/resolv.conf place ONLY the management server and the cluster domain:
search cluster.net nameserver 10.0.0.1
Note: Make sure that this domain ‘cluster.net’ matches what’s in your site table!!!
2. In the site table, add the EXTERNAL name servers to the forwarders:
‘forwarders’,’9.0.2.1,9.0.3.1′
3. Run makedns
4. Run service named restart
Huzah! You will now be able to resolve everywhere!
Sep 1st
The State of my iPhone is good. I’m super happy with it, and its by far the best phone I’ve ever had. I can’t say I’ve ever had a blackberry, but from what I’ve seen from the competition I have not been impressed. So I remain an Apple fan, and I don’t think the Android phone is ever going to beat it. The iPhone is to smart phones as Disneyland is to amusement parks. There are some that have compelling offerings, like open source, an open market, or a faster/bigger roller coaster. But having the business model where the entire experience is completely controlled is what makes Apple work, and probably the reason they are cringing at AT&T every day.
Since I’ve gone to the iPhone 3.0.1 I’ve stopped jail breaking my phone. The only reason I had jailbroken the phone in the past was because I wanted tethering, video, and Mike Tyson’s Punch-Out.
The video now comes with the iPhone 3GS, but on my iPhone 3G, I don’t have such luxury. But it turns out I didn’t use it all that much. So no big loss. Also the footage I came out with wasn’t all that great. I guess I don’t have that exciting of a life. So no big deal losing that.
Tethering is the act of using your iPhone as a modem for your PC to get you on line. I used to use PDA net from the jail break stuff. But my friend recently showed me the link below:
http://help.BenM.at
From here it was easy to get tethering. I opened the above URL from my iPhone, followed the prompts, and now my iPhone can get my computer on line and I can do work.
As far as Mike Tyson’s Punch out, I think I’ll be content after having made it to Bald Bull 2. So I no longer have an NES emulator on my iPhone.
My top apps:
The last 3 are ones that my friend just introduced me to.
Aug 29th
In CSS I always forget that when creating a box to put things in you want the box to expand with the contents that are inside of it. The secret to this I found out last year was to put overflow: auto in.
Example:
content {
background-color: #fff;
width: 960px;
margin: 0px auto;
overflow: auto;
text-align: left;
}
The other trick was to put margin: 0px auto. That makes the element float in the middle of a page. This is the standard container CSS that I use.