Gridengine and CentOS 7

… there’s life in the old dog yet!

We are still using the Gridengine on some of our high performance clusters and getting that thing running isn’t really a piece of cake. Since Oracle bought Sun, things have changed a little bit: First of all, the good old (TM) Sun Grid engine doesn’t exist anymore. There are some clones of it, with the most promising candidate being probably the Son of Grid Engine project. This is also what I will refer as gridengine henceforth. Noticeworthy, but not covered is the OpenGrid Scheduler and the commercial Univa Grid Engine (I’m not linking there), which is just the old Sun Grid engine, but sold to Univa and commercially distributed

In the Debian world, there is a gridengine deb packet, which just works nicely as it should do. There was a el6 port for CentOS 6, but there is nothing official for CentOS 7 (yet?). I’ve build them myself and everyone is free to use them as they are. They are provided as-they-are, so no support or warranty of any kind are provided. Also, they should work just fine as they are

Building the Son of Grid Engine

The process was difficult enough to make me fork the repository and setup my own GitHub project. My fork contains two bugfixes, which prevented the original source from building.
The project contains also build instructions in the README.md for OpenSUSE 15 and CentOS 7 and pre-compiled rpms in the releases section.

Short notes about building

The Gridengine comes with it’s own build tool, called aimk. One can say a lot about it, but if treated correctly it works okayish. The list of requirements is long and listed in the README.md for CentOS 7 and OpenSUSE 15. It hopefully also works for any other versions.

SGE uses a lot of different libraries. Mixing architectures for a single cluster environment is in general a bad idea and SGE might work, but you really don’t want to hassle with the inevitable white hairs that come with all of the unpredictable and sometimes not-easy-to-understand voodoo errors that occur. Just … Don’t do that!

I never used the Hadoop build, so all binaries and everything is tested with -no-herd.

For the impatient (not commented)

git clone https://github.com/grisu48/gridengine.git
cd gridengine/sge-8.1.9/source
./scripts/bootstrap.sh
./aimk -no-herd -no-java
# Eventually mkdir /opt/sge
sudo SGE_ROOT="/opt/sge" scripts/distinst -local -allall -noexit       # asks for confirmation
export SGE_ROOT="/opt/sge"
cd $SGE_ROOT

./install_qmaster        # On the Master Host
./install_execd          # On the execution host (Compute node)

Notes about installing the Gridengine

I’ve tried to automate the install with ansible, but the install_execd -auto script proves to be quiet unreliable. After several failed attempts, I decided to install the Gridengine manually from a shared NFS directory.

This is in general a good idea, as the spool directory anyways needs to be in a NFS share. To prevent trouble I have separated the binaries (read-only NFS) from the spool directory (read-write access to all nodes).

I’ve tried to mix CentOS and OpenSUSE. The Gridengine works with each other, but you will run into other problems as the execution environment is different. Don’t do that!

Running the SGE over NFS is the way I recommend. Be aware of the hassle, when the master node becomes unresponsive. In that case, don’t do magic tricks, just reboot the nodes. Everything else is fishy.

Known problems with Son of Grid engine

This section is dedicated to document two bugs and make them appear on google, so that other unfortunate beings, who encounter the same problems can find a solution. I’ve encountered two errors, when trying to build the original 8.1.9 version

../sh.proc.c:153:16: error: storage size of ‘w’ isn’t known
     union wait w

This problem was the reason for me to fork it. Comment out line 51 in sge-8.1.9/source/3rdparty/qtcsh/sh.proc.c

#if defined(_BSD) || (defined(IRIS4D) && <strong>STDC</strong>) || defined(<strong>lucid) || defi  ned(linux) || defined(__GNU</strong>) || defined(<strong>GLIBC</strong>)<br />
  //# define BSDWAIT   // Fix for "../sh.proc.c:153:16: error: storage size of ‘  w’ isn’t known"<br />
  #endif /* _BSD || (IRIS4D && <strong>STDC</strong>) || __lucid || glibc */

undefined reference to tputs, tgoto, ecc.

I encountered this error when building as root. Try building as unprivileged user (which you should do anyways!)

Mirrors

I am mirroring the current version of Son of Grid engine on my ftp-server. My own fork is in the GitHub repository gridengine.