diff --git a/doc/jsspp/abstract.tex b/doc/jsspp/abstract.tex
new file mode 100644
index 0000000000000000000000000000000000000000..2cfcbfec0c413f83fcaab014215d56840bcd0e8d
--- /dev/null
+++ b/doc/jsspp/abstract.tex
@@ -0,0 +1,12 @@
+\begin{abstract}
+A new cluster resource management system called 
+Simple Linux Utility Resource Management (SLURM) is developed and 
+presented in this paper. SLURM, initially developed for large 
+Linux clusters at the Lawrence Livermore National Laboratory (LLNL), 
+is a simple cluster manager that can scale to thousands of processors. 
+SLURM is designed to be flexible and fault-tolerant and can be ported to 
+other clusters of different size and architecture with minimal effort. 
+We are certain that SLURM will benefit both users and system architects 
+by providing them with a simple, robust, and highly scalable parallel 
+job execution environment for their cluster system.
+\end{abstract}
diff --git a/doc/jsspp/architecture.tex b/doc/jsspp/architecture.tex
new file mode 100644
index 0000000000000000000000000000000000000000..bc84b2e616d83b9518c81ab63071c97294874961
--- /dev/null
+++ b/doc/jsspp/architecture.tex
@@ -0,0 +1,170 @@
+\section{SLURM Architecture}
+
+As a cluster resource manager, SLURM has three key functions.  First,
+it allocates exclusive and/or non-exclusive access to resources to users for 
+some duration of time so they can perform work.  Second, it provides 
+a framework for starting, executing, and monitoring work  
+on the set of allocated nodes.  Finally, it arbitrates 
+conflicting requests for resources by managing a queue of pending work.
+Users and system administrators interact with SLURM using simple commands.
+
+%Users interact with SLURM through four command line utilities: 
+%\srun\ for submitting a job for execution and optionally controlling it
+%interactively, 
+%\scancel\ for early termination of a pending or running job, 
+%\squeue\ for monitoring job queues, and 
+%\sinfo\ for monitoring partition and overall system state.
+%System administrators perform privileged operations through an additional
+%command line utility: {\tt scontrol}.
+%
+%The central controller daemon, {\tt slurmctld}, maintains the global state 
+%and directs operations.
+%Compute nodes simply run a \slurmd\ daemon (similar to a remote shell 
+%daemon) to export control to SLURM.  
+%
+%SLURM is not a sophisticated batch system.  
+%In fact, it was expressly designed to provide high-performance 
+%parallel job management while leaving scheduling decisions to an 
+%external entity as will be described later. 
+
+\begin{figure}[tb]
+\centerline{\epsfig{file=figures/arch.eps,scale=0.40}}
+\caption{SLURM Architecture}
+\label{arch}
+\end{figure}
+
+Figure~\ref{arch} depicts the key components of SLURM. As shown in Figure~\ref{arch},
+SLURM consists of a \slurmd\ daemon
+running on each compute node, a central \slurmctld\ daemon running on
+a management node (with optional fail-over twin), and five command line
+utilities,
+% {\tt srun}, {\tt scancel}, {\tt sinfo}, {\tt squeue}, and {\tt scontrol}, 
+which can run anywhere in the cluster.  
+
+The entities managed by these SLURM daemons include nodes, the
+compute resource in SLURM, partitions, which group nodes into
+logical disjoint sets, jobs, or allocations of resources assigned
+to a user for a specified amount of time, and job steps, which are
+sets of tasks within a job.  
+Each job is allocated nodes within a single partition. 
+Once a job is assigned a set of nodes, the user is able to initiate
+parallel work in the form of job steps in any configuration within the
+allocation. For instance a single job step may be started which utilizes
+all nodes allocated to the job, or several job steps may independently 
+use a portion of the allocation.
+
+%\begin{figure}[tcb]
+%\centerline{\epsfig{file=figures/entities.eps,scale=0.7}}
+%\caption{SLURM Entities}
+%\label{entities}
+%\end{figure}
+%
+%Figure~\ref{entities} further illustrates the interrelation of these
+%entities as they are managed by SLURM. The diagram shows a group of
+%compute nodes split into two partitions. Partition 1 is running one
+%job, with one job step utilizing the full allocation of that job.
+%The job in Partition 2 has only one job step using half of the original
+%job allocation.
+%That job might initiate additional job step(s) to utilize 
+%the remaining nodes of its allocation.
+
+\begin{figure}[tb]
+\centerline{\epsfig{file=figures/slurm-arch.eps,scale=0.5}}
+\caption{SLURM Architecture - Subsystems}
+\label{archdetail}
+\end{figure}
+
+Figure~\ref{archdetail} exposes the subsystems that are implemented
+within the \slurmd\ and \slurmctld\ daemons.  These subsystems
+are explained in more detail below.
+
+\subsection{SLURM Local Daemon (Slurmd)}
+
+The \slurmd\ is a multi-threaded daemon running on each compute node.
+It reads the common SLURM configuration file, 
+notifies the controller that it is active, waits for work, 
+executes the work, returns status,then waits for more work.  
+Since it initiates jobs for other users, it must run with root privilege.
+It also asynchronously exchanges node and job status information with {\tt slurmctld}.  
+The only job information it has at any given time pertains to its 
+currently executing jobs.
+The \slurmd\ performs five major tasks.
+
+\begin{itemize}
+\item {\em Machine and Job Status Services}:  Respond to controller 
+requests for machine and job state information, and send asynchronous 
+reports of some state changes (e.g. \slurmd\ startup) to the controller.
+
+\item {\em Remote Execution}: Start, monitor, and clean up after a set
+of processes (typically belonging to a parallel job) as dictated by the
+\slurmctld\ daemon or an \srun\ or \scancel\ command. Starting a process may
+include executing a prolog program, setting process limits, setting real
+and effective user id, establishing environment variables, setting working
+directory, allocating interconnect resources, setting core file paths,
+initializing the Stream Copy Service, and managing
+process groups. Terminating a process may include terminating all members
+of a process group and executing an epilog program.
+
+\item {\em Stream Copy Service}: Allow handling of stderr, stdout, and
+stdin of remote tasks. Job input may be redirected from a file or files, a
+\srun\ process, or /dev/null.  Job output may be saved into local files or
+sent back to the \srun\ command. Regardless of the location of stdout or stderr,
+all job output is locally buffered to avoid blocking local tasks.
+
+\item {\em Job Control}: Allow asynchronous interaction with the
+Remote Execution environment by propagating signals or explicit job
+termination requests to any set of locally managed processes.
+
+\end{itemize}
+
+\subsection{SLURM Central Daemon (Slurmctld)}
+
+Most SLURM state information is maintained by the controller, {\tt slurmctld}.
+The \slurmctld\ is multi-threaded with independent read and write locks 
+for the various data structures to enhance scalability. 
+When \slurmctld\ starts, it reads the SLURM configuration file.  
+It also can read additional state information
+from a checkpoint file generated by a previous execution of {\tt slurmctld}.
+Full controller state information is written to 
+disk periodically with incremental changes written to disk immediately
+for fault-tolerance.  
+The \slurmctld\ runs in either master or standby mode, depending on the
+state of its fail-over twin, if any.
+The \\slurmctld\ need not execute with root privilege.
+%In fact, it is recommended that a unique user entry be created for 
+%executing \slurmctld\ and that user must be identified in the SLURM 
+%configuration file as {\tt SlurmUser}.
+The \slurmctld\ consists of three major components:
+
+\begin{itemize}
+\item {\em Node Manager}: Monitors the state of each node in
+the cluster.  It polls {\tt slurmd}'s for status periodically and
+receives state change notifications from \slurmd\ daemons asynchronously.
+It ensures that nodes have the prescribed configuration before being 
+considered available for use.
+
+\item {\em Partition Manager}: Groups nodes into non-overlapping sets called
+{\em partitions}. Each partition can have associated with it various job
+limits and access controls.  The partition manager also allocates nodes
+to jobs based upon node and partition states and configurations. Requests
+to initiate jobs come from the Job Manager.  The \scontrol\ may be used
+to administratively alter node and partition configurations.
+
+\item {\em Job Manager}: Accepts user job requests and places pending 
+jobs in a priority ordered queue. 
+The Job Manager is awakened on a periodic basis and whenever there
+is a change in state that might permit a job to begin running, such
+as job completion, job submission, partition-up transition,
+node-up transition, etc.  The Job Manager then makes a pass
+through the priority-ordered job queue. The highest priority jobs 
+for each partition are allocated resources as possible. As soon as an 
+allocation failure occurs for any partition, no lower-priority jobs for 
+that partition are considered for initiation. 
+After completing the scheduling cycle, the Job Manager's scheduling
+thread sleeps.  Once a job has been allocated resources, the Job Manager
+transfers necessary state information to those nodes, permitting it 
+to commence execution.  When the Job Manager detects that
+all nodes associated with a job have completed their work, it initiates
+clean-up and performs another scheduling cycle as described above.
+
+\end{itemize}
diff --git a/doc/jsspp/interaction.tex b/doc/jsspp/interaction.tex
new file mode 100644
index 0000000000000000000000000000000000000000..9f0246293d54b29b6efd4f9c810d67110df171c2
--- /dev/null
+++ b/doc/jsspp/interaction.tex
@@ -0,0 +1,231 @@
+\section{Scheduling Infrastructure}
+
+Scheduling parallel computers is a very complex matter.  
+Several good public domain schedulers exist with the most 
+popular being the Maui Scheduler\cite{Jackson2001,Maui2002}. 
+The scheduler used at our site, DPCS\cite{DPCS2002}, is quite 
+sophisticated and has over 150,000 lines of code. 
+We felt no need to address scheduling issues within SLURM, but 
+have instead developed a resource manager with a rich set of 
+application programming interfaces (APIs) and the flexibility 
+to satisfy the needs of others working on scheduling issues.  
+SLURM's default scheduler implements First-In First-Out (FIFO). 
+An external entity can establish a job's initial priority 
+through a plugin.
+An external scheduler may also submit, signal, hold, reorder and 
+terminate jobs via the API.
+
+\subsection{Resource Specification}
+
+The \srun\ command and corresponding API have a wide of resource 
+specifications available. The \srun\ resource specification options 
+are described below.
+
+\subsubsection{Geometry Specification}
+
+These options describe how many nodes and tasks are needed as
+well as describing the distribution of tasks across the nodes.
+
+\begin{itemize}
+\item {\tt cpus-per-task=<number>}: 
+Specifies the number of processors cpus) required for each task 
+(or process) to run. 
+This may be useful if the job is multithreaded and requires more 
+than one cpu per task for optimal performance. 
+The default is one cpu per process.
+
+\item {\tt nodes=<number>[-<number>]}: 
+Specifies the number of nodes required by this job. 
+The node count may be either a specific value or a minimum and maximum 
+node count separated by a hyphen. 
+The partition's node limits supersede those of the job. 
+If a job's node limits are completely outside of the range permitted 
+for it's associated partition, the job will be left in a PENDING state. 
+The default is to allocate one cpu per process, such that nodes with 
+one cpu will run one task, nodes with 2 cpus will run two tasks, etc.
+The distribution of processes across nodes may be controlled using 
+this option along with the {\tt nproc} and {\tt cpus-per-task} options.
+
+\item {\tt nprocs=<number>}: 
+Specifies the number of processes to run. 
+Specification of the number of processes per node may be achieved 
+with the {\tt cpus-per-task} and {\tt nodes} options. 
+The default is one process per node unless {\tt cpus-per-task} 
+explicitly specifies otherwise.
+
+\end{itemize}
+
+\subsubsection{Constraint Specification}
+
+These options describe what configuration requirements of the nodes 
+which can be used.
+
+\begin{itemize}
+
+\item {\tt constraint=list}: 
+Specify a list of constraints. The list of constraints is
+a comma separated list of features that have been assigned to the
+nodes by the slurm administrator. If no nodes have the requested
+feature, then the job will be rejected.
+
+\item {\tt contiguous=[yes|no]}: 
+demand a contiguous range of nodes. The default is "yes". 
+
+\item {\tt mem=<number>}: 
+Specify a minimum amount of real memory per node (in megabytes).
+
+\item {\tt mincpus=<number>}: 
+Specify minimum number of cpus per node.
+
+\item {\tt partition=name}: 
+Specifies the partition to be used. 
+There will be a default partition specified in the SLURM configuration file.
+
+\item {\tt tmp=<number>}: 
+Specify a minimum amount of temporary disk space per node (in megabytes).
+
+\item {\tt vmem=<number>}: 
+Specify a minimum amount of virtual memory per node (in megabytes).
+
+\end{itemize}
+
+\subsubsection{Other Resource Specification}
+
+\begin{itemize}
+
+\item {\tt batch}: 
+Submit in "batch mode." 
+srun will make a copy of the executable file (a script) and submit therequest for execution when resouces are available.
+srun will terminate after the request has been submitted. 
+The executable file will run on the first node allocated to the 
+job and must contain srun commands to initiate parallel tasks.
+
+\item {\tt exclude=[filename|node\_list]}: 
+Request that a specific list of hosts not be included in the resources 
+allocated to this job. The host list will be assumed to be a filename 
+if it contains a "/"character. If some nodes are suspect, this option 
+may be used to avoid using them.
+
+\item {\tt immediate}: 
+Exit if resources are not immediately available. 
+By default, the request will block until resources become available.
+
+\item {\tt nodelist=[filename|node\_list]}: 
+Request a specific list of hosts. The job will contain at least
+these hosts. The list may be specified as a comma-separated list of
+hosts, a range of hosts (host[1-5,7,...] for example), or a filename.
+The host list will be assumed to be a filename if it contains a "/"
+character.
+
+\item {\tt overcommit}: 
+Overcommit resources. 
+Normally the job will not be allocated more than one process per cpu. 
+By specifying this option, you are explicitly allowing more than one process
+per cpu. 
+
+\item {\tt share}: 
+The job can share nodes with other running jobs. This may result in faster job 
+initiation and higher system utilization, but lower application performance.
+
+\item {\tt time=<number>}: 
+Establish a time limit to terminate the job after the specified number of 
+minutes. If the job's time limit exceed's the partition's time limit, the 
+job will be left in a PENDING state. The default value is the partition's 
+time limit. When the time limit is reached, the job's processes are sent 
+SIGXCPU followed by SIGKILL. The interval between signals is configurable.
+
+\end{itemize}
+
+All parameters may be specified using single letter abbreviations 
+("-n" instead of "--nprocs=4"). 
+Environment variable can also be used to specify many parameters. 
+Environment variable will be set to the actual number of nodes and 
+processors allocated
+In the event that the node count specification is a range, the 
+application could inspect the environment variables to scale the 
+problem appropriately.
+To request four processes with one cpu per task the command line would 
+look like this: {\em srun --nprocs=4 --cpus-per-task=1 hostname}.
+Note that if multiple resource specifications are provided, resources 
+will be allocated so as to satisfy the all specifications. 
+For example a request with the specification {\tt nodelist=dev[0-1]} 
+and {\tt nodes=4} may be satisfied with nodes {\tt dev[0-3]}.
+
+\subsection{The Maui Scheduler and SLURM}
+
+{\em The integration of the Maui Scheduler with SLURM was 
+just beginning at the time this paper was written. Full 
+integration is anticipated by the time of the conference.
+This section will be modified as needed based upon that 
+experience.}
+
+The Maui Scheduler is integrated with SLURM through the 
+previously described plugin mechanism. 
+The previously described SLURM commands are used for 
+all job submissions and interactions. 
+When a job is submitted to SLURM, a Maui Scheduler module 
+is called to establish its initial priority. 
+Another Maui Scheduler module is called at the beginning 
+of each SLURM scheduling cycle. 
+Maui can use this opportunity to change priorities of 
+pending jobs or take other actions.
+
+\subsection{DPCS and SLURM}
+
+DPCS is a meta-batch system designed for use within a single 
+administrative domain (all computers have a common user ID 
+space and exist behind a firewall). 
+DPCS presents users with a uniform set of commands for a wide 
+variety of computers and underlying resource managers (e.g. 
+LoadLeveler on IBM SP systems, SLURM on Linux clusters, NQS, 
+etc.). 
+It was developed in 1991 and has been in production use since 
+1992. 
+While Globus\cite{Globus2002} has the ability to span administrative 
+domains, both systems could interface with SLURM in a similar fashion.
+
+Users submit jobs directly to DPCS.
+The job consists of a script and an assortment of constraints. 
+Unless specified by constraints, the script can execute on 
+a variety of different computers with various architectures 
+and resource managers. 
+DPCS monitors the state of these computers and performs backfill 
+scheduling across the computers with jobs under its management. 
+When DPCS decides that resources are available to immediately 
+initiate some job of its choice, it takes the following 
+actions:
+\begin{itemize}
+\item Transfers the job script and assorted state information to 
+the computer upon which the job is to execute.
+
+\item Allocates resources for the job. 
+The resource allocation is performed as user {\em root} and SLURM 
+is configured to restrict resource allocations in the relevent 
+partitions to user {\em root}.
+This prevents user resource allocations to that partition
+except through DPCS, which has complete control over job 
+scheduling there.
+The allocation request specifies the target user ID, job ID 
+(to match DPCS' own numbering scheme) and specific nodes to use.
+
+\item Spawns the job script as the desired user.
+This script may contain multiple instantiations of \srun\ 
+to initiate multiple job steps. 
+
+\item Monitor the job's state and resource consumption. 
+This is performed using DPCS daemons on each compute node 
+recording CPU time, real memory and virtual memory consumed. 
+
+\item Cancel the job as needed when it has reached its time limit. 
+The SLURM job is initiated with an infinite time limit. 
+DPCS mechanisms are used exclusively to manage job time limits. 
+
+\end{itemize}
+
+Much of the SLURM functionality is left unused in the DPCS 
+controlled environment. 
+It should be noted that DPCS is typically configured to not 
+control all partitions. 
+A small (debug) partition is typically configured for smaller 
+jobs and users may directly use SLURM commands to access that 
+partition.
diff --git a/doc/jsspp/intro.tex b/doc/jsspp/intro.tex
new file mode 100644
index 0000000000000000000000000000000000000000..73153ba0ca1880a006a53cddc9d5b4f5bb043300
--- /dev/null
+++ b/doc/jsspp/intro.tex
@@ -0,0 +1,108 @@
+\section{Introduction}
+Linux clusters, often constructed by using commodity off-the-shelf (COTS) componnets,
+have become increasingly populuar as a computing platform for parallel computation
+in recent years, mainly due to their ability to deliver high perfomance-cost ratio.
+Researchers have built and used small to medium size clusters for developing and debugging
+applicationsd~\cite{BeowulfWeb,LokiWeb}.
+Now, continuous decrease in the price of the COTS parts  in conjunction with
+the good scalability of the cluster architecture has made it feasible to economically
+build large-scale clusters with thousands of processors~\cite{MCRWeb,PCRWeb}.
+
+An essential component that is needed to 
+run user applications on a cluster is cluster management system.
+A cluster management system (or cluster manager) performs such crucial tasks as
+scheduling user jobs, monitoring machine and job ststus, launching user applications, and
+managing machine configuration,
+An ideal cluster manager should be simple, efficient, scalable, and fault-tolerant.
+
+Unfortunately, there are no open-source or proprietary cluster management systems currently available, 
+which satisfy these requirements while being immediately available for clusters of any architecture. 
+A survey~\cite{Jette02} has revealed that the existing cluster managers have poor scalability and hence 
+they are not suitable for large clusters of thousands of 
+processors~\cite{LoadLevelerWeb,LoadLevelerManual}.
+Many of these cluster managers also cannot be easily ported to other clusters since they are 
+tied to specific components of the cluster that they are 
+targeted to~\cite{RMS,LoadLevelerWeb,LoadLevelerManual}. 
+Furthermore, their cluster management functionality is usually provided as a part of a specific 
+job scheduling system package. 
+This mandates the use of the given scheduler just to manage a cluster, 
+even though the scheduler does not necessarily meet the need of organization that hosts the cluster.
+Clear separation of the cluster management functionality from scheduling policy is desired.
+
+This observation led us to set out to design a simple, highly scalable, and 
+portable cluster management system that performs the core of cluster resource management functions. 
+The result of this effort is Simple Linux Utility Resource Management 
+(SLURM\footnote{A tip of the hat to Matt Groening and creators of {\em Futurama},
+where Slurm is the most popular carbonated beverage in the universe.}). 
+SLURM was developed with the following design goals:
+
+\begin{itemize}
+\item {\em Simplicity}: SLURM is simple enough to allow motivated end-users
+to understand its source code and add functionality.  The authors will 
+avoid the temptation to add features unless they are of general appeal. 
+
+\item {\em Open Source}: SLURM is available to everyone and will remain free. 
+Its source code is distributed under the GNU General Public 
+License~\cite{GPLWeb}.
+
+\item {\em Portability}: SLURM is written in the C language, with a GNU 
+{\em autoconf} configuration engine.  
+While initially written for Linux, other UNIX-like operating systems 
+should be easy porting targets.
+SLURM also supports a general purpose {\em plugin} mechanism, which 
+permits a variety of different infrastructures to be easily supported. 
+The SLURM configuration file specifies which set of plugin modules 
+should be used. 
+
+\item {\em Interconnect independence}: SLURM supports UDP/IP based
+communication as well as the Quadrics Elan3 and Myrinet interconnects.  
+Adding support for other interconnects is straightforward and utilizes 
+the plugin mechanism described above.
+
+\item {\em Scalability}: SLURM is designed for scalability to clusters of
+thousands of nodes. 
+Jobs may specify their resource requirements in a variety of ways 
+including requirements options and ranges, potentially permitting 
+faster initiation than otherwise possible.
+
+\item {\em Robustness}: SLURM can handle a variety of failure modes
+without terminating workloads, including crashes of the node running 
+the SLURM controller. 
+User jobs may be configured to continue execution despite the failure 
+of one or more nodes on which they are executing. 
+Nodes allocated to a job are available for reuse as soon as the job(s) 
+allocated to that node terminate. 
+If some nodes fail to complete job termination 
+in a timely fashion due to hardware of software problems, only the 
+scheduling of those tardy nodes will be effected.
+
+\item {\em Secure}: SLURM employs crypto technology to authenticate 
+users to services and services to each other with a variety of options 
+available through the plugin mechanism.  
+SLURM does not assume that its networks are physically secure, 
+but does assume that the entire cluster is within a single 
+administrative domain with a common user base across the 
+entire cluster.
+
+\item {\em System administrator friendly}: SLURM is configured a 
+simple configuration file and minimizes distributed state.  
+Its configuration may be changed at any time without impacting running jobs. 
+Heterogeneous nodes within a cluster may be easily managed.
+SLURM interfaces are usable by scripts and its behavior is highly 
+deterministic.
+
+\end{itemize}
+
+The main contribution of our work is that we have provided a readily available 
+and inexpensive tool that anybody can use to efficiently manage clusters of different size and architecture. 
+The SLURM is highly scalable\footnote{It was observed that it took less than two seconds for SLURM to launch a thousand-task job on 
+a large cluster currently being built for Lawrence Livermore National Laboratory.}. 
+The SLURM can be easily ported to any cluster system with minimal effort with its plug-in 
+capability and can be used with any meta-batch scheduler or even a Grid resource broker~\cite{Gridbook}
+with its well-defined interfaces.
+
+The rest of the paper is organized as follows. 
+Section 2 describes the architecture of SLURM in detail. Section 3 discusses the services provided by SLURM followed by performance study of 
+SLURM in Section 4. Brief survey of existing cluster management systems is presented in Section 5. 
+%Section 6 describes how the SLURM can be used with more sphisticated external schedulers.
+Concluding remarks and future development plan of SLURM is given in Section 6.
diff --git a/doc/jsspp/perf.tex b/doc/jsspp/perf.tex
new file mode 100644
index 0000000000000000000000000000000000000000..d269816855bf040ca8c56c49c415484ac0e1e1ca
--- /dev/null
+++ b/doc/jsspp/perf.tex
@@ -0,0 +1,16 @@
+\section{Performance Study}
+
+\begin{figure}[htb]
+\centerline{\epsfig{file=figures/times.eps}}
+\caption{Time to execute /bin/hostname with various node counts}
+\label{timing}
+\end{figure}
+
+We were able to perform some SLURM tests on a 1000 node cluster at LLNL.
+Some development was still underway at that time and 
+tuning had not been performed. The results for executing simple 'hostname' program 
+on two tasks per node and various node counts is show 
+in Figure~\ref{timing}. We found SLURM performance to be comparable 
+to the Quadrics Resource Management System (RMS)~\cite{RMS} 
+for all job sizes and about 80 times faster than IBM 
+LoadLeveler~\cite{LoadLevelerWeb,LoadLevelerManual} at tested job sizes.
diff --git a/doc/jsspp/services.tex b/doc/jsspp/services.tex
new file mode 100644
index 0000000000000000000000000000000000000000..18ded4b7a7d7a6e86b6c81fc3b2bb463a8e5a019
--- /dev/null
+++ b/doc/jsspp/services.tex
@@ -0,0 +1,320 @@
+\section{SLURM Operation and Services}
+\subsection{Command Line Utilities}
+
+The command line utilities are the user interface to SLURM functionality.
+They offer users access to remote execution and job control. They also 
+permit administrators to dynamically change the system configuration. 
+These commands all use SLURM APIs which are directly available for 
+more sophisticated applications.
+
+\begin{itemize}
+\item {\tt scancel}: Cancel a running or a pending job or job step, 
+subject to authentication and authorization. This command can also 
+be used to send an arbitrary signal to all processes on all nodes 
+associated with a job or job step.
+
+\item {\tt scontrol}: Perform privileged administrative commands
+such as draining a node or partition in preparation for maintenance. 
+Many \scontrol\ functions can only be executed by privileged users.
+
+\item {\tt sinfo}: Display a summary of partition and node information.
+A assortment of filtering and output format options are available.
+
+\item {\tt squeue}: Display the queue of running and waiting jobs 
+and/or job steps. A wide assortment of filtering, sorting, and output 
+format options are available.
+
+\item {\tt srun}: Allocate resources, submit jobs to the SLURM queue,
+and initiate parallel tasks (job steps). 
+Every set of executing parallel tasks has an associated \srun\ which 
+initiated it and, if the \srun\ persists, managing it. 
+Jobs may be submitted for batch execution, in which case 
+\srun\ terminates after job submission. 
+Jobs may also be submitted for interactive execution, where \srun\ keeps 
+running to shepherd the running job. In this case, 
+\srun\ negotiates connections with remote {\tt slurmd}'s 
+for job initiation and to
+get stdout and stderr, forward stdin, and respond to signals from the user.
+The \srun\ may also be instructed to allocate a set of resources and
+spawn a shell with access to those resources.
+
+\end{itemize}
+
+\subsection{Plugins}
+
+In order to make the use of different infrastructures possible, 
+SLURM uses a general purpose plugin mechanism. 
+A SLURM plugin is a dynamically linked code object which is 
+loaded explicitly at run time by the SLURM libraries. 
+A plugin provides a customized implemenation of a well-defined
+API connected to tasks such as authentication, interconnect fabric, 
+task scheduling.
+A set of functions is defined for use by all of the different 
+infrastructures of a particular variety. 
+For example, the authentication plugin must define functions 
+such as: 
+{\tt slurm\_auth\_activate} to create a credential,
+{\tt slurm\_auth\_verify} to verify a credential to 
+approve or deny authentication, 
+{\tt slurm\_auth\_get\_uid} to get the user ID associated with 
+a specific credential.
+It also must define the data structure used, a plugin type, 
+a plugin version number.
+The available plugins are defined in the configuration file.
+%When a slurm daemon is initiated, it reads the configuration 
+%file to determine which of the available plugins should be used. 
+%For example {\em AuthType=auth/authd} says to use the plugin for 
+%authd based authentication and {\em PluginDir=/usr/local/lib} 
+%identifies the directory in which to find the plugin.
+
+\subsection{Communications Layer}
+
+SLURM presently uses Berkeley sockets for communications. 
+However, we anticipate using the plugin mechanism to easily 
+permit use of other communications layers. 
+At LLNL we are using an Ethernet for SLURM communications and 
+the Quadrics Elan switch exclusively for user applications. 
+The SLURM configuration file permits the identification of each 
+node's hostname as well as its name to be used for communications. 
+%In the case of a control machine known as {\em mcri} to be 
+%communicated with using the name {\em emcri} (say to indicate 
+%an ethernet communications path), this is represented in the 
+%configuration file as {\em ControlMachine=mcri ControlAddr=emcri}.
+%The name used for communication is the same as the hostname unless 
+%%otherwise specified.
+
+While SLURM is able to manage 1000 nodes without difficulty using 
+sockets and Ethernet, we are reviewing other communication 
+mechanisms which may offer improved scalability. 
+One possible alternative is STORM\cite{STORM2001}. 
+STORM uses the cluster interconnect and Network Interface Cards to 
+provide high-speed communications including a broadcast capability. 
+STORM only supports the Quadrics Elan interconnnect at present, 
+but does offer the promise of improved performance and scalability. 
+
+\subsection{Security}
+
+SLURM has a simple security model: 
+Any user of the cluster may submit parallel jobs to execute and cancel
+his own jobs.  Any user may view SLURM configuration and state
+information.  
+Only privileged users may modify the SLURM configuration,
+cancel any jobs, or perform other restricted activities.  
+Privileged users in SLURM include the users {\em root} 
+and {\tt SlurmUser} (as defined in the SLURM configuration file). 
+If permission to modify SLURM configuration is 
+required by others, set-uid programs may be used to grant specific
+permissions to specific users.
+
+We presently support three authentication mechanisms via plugins: 
+{\tt authd}\cite{Authd2002}, {\tt munged} and {\tt none}. 
+A plugin can easily be developed for Kerberos or authentication 
+mechanisms as desired.
+The \munged\ implementation is described below.
+A \munged\ daemon running as user {\em root} on each node confirms the 
+identify of the user making the request using the {\tt getpeername} 
+function and generates a credential. 
+The credential contains a user ID, 
+group ID, time-stamp, lifetime, some pseudo-random information, and 
+any user supplied information. The \munged\ uses a private key to 
+generate a Message Authentication Code (MAC) for the credential.
+The \munged\ then uses a public key to symmetrically encrypt 
+the credential including the MAC. 
+SLURM daemons and programs transmit this encrypted 
+credential with communications. The SLURM daemon receiving the message 
+sends the credential to \munged\ on that node. 
+The \munged\ decrypts the credential using its private key, validates it 
+and returns the user ID and group ID of the user originating the 
+credential.
+The \munged\ prevents replay of a credential on any single node 
+by recording credentials that have already been authenticated.
+In SLURM's case, the user supplied information includes node 
+identification information to prevent a credential from being 
+used on nodes it is not destined for.
+
+When resources are allocated to a user by the controller, a 
+{\em job step credential} is generated by combining the user ID, job ID, 
+step ID, the list of resources allocated (nodes), and the credential
+lifetime. This job step credential is encrypted with 
+a \slurmctld\ private key. This credential 
+is returned to the requesting agent ({\tt srun}) along with the
+allocation response, and must be forwarded to the remote {\tt slurmd}'s 
+upon job step initiation. \slurmd\ decrypts this credential with the
+\slurmctld 's public key to verify that the user may access
+resources on the local node. \slurmd\ also uses this job step credential 
+to authenticate standard input, output, and error communication streams. 
+
+%Access to partitions may be restricted via a {\em RootOnly} flag.  
+%If this flag is set, job submit or allocation requests to this 
+%partition are only accepted if the effective user ID originating 
+%the request is a privileged user. 
+%The request from such a user may submit a job as any other user. 
+%This may be used, for example, to provide specific external schedulers
+%with exclusive access to partitions.  Individual users will not be 
+%permitted to directly submit jobs to such a partition, which would 
+%prevent the external scheduler from effectively managing it.  
+%Access to partitions may also be restricted to users who are 
+%members of specific Unix groups using a {\em AllowGroups} specification.
+
+\subsection{Job Initiation}
+
+There are three modes in which jobs may be run by users under SLURM. The
+first and most simple is {\em interactive} mode, in which stdout and
+stderr are displayed on the user's terminal in real time, and stdin and
+signals may be forwarded from the  terminal transparently to the remote
+tasks. The second is {\em batch} mode, in which the job is
+queued until the request for resources can be satisfied, at which time the
+job is run by SLURM as the submitting user. In {\em allocate} mode,
+a job is allocated to the requesting user, under which the user may
+manually run job steps via a script or in a sub-shell spawned by \srun .
+
+\begin{figure}[tb]
+\centerline{\epsfig{file=figures/connections.eps,scale=0.5}}
+\caption{\small Job initiation connections overview. 1. The \srun\ connects to
+         \slurmctld\ requesting resources. 2. \slurmctld\ issues a response,
+         with list of nodes and job credential. 3. The \srun\ opens a listen
+         port for every task in the job step, then sends a run job step
+         request to \slurmd . 4. \slurmd 's initiate job step and connect
+         back to \srun\ for stdout/err. }
+\label{connections}
+\end{figure}
+
+Figure~\ref{connections} gives a high-level depiction of the connections
+that occur between SLURM components during a general interactive job
+startup.
+The \srun\ requests a resource allocation and job step initiation from the {\tt slurmctld},
+which responds with the job ID, list of allocated nodes, job credential.
+if the request is granted.
+The \srun\ then initializes listen ports for each
+task and sends a message to the {\tt slurmd}'s on the allocated nodes requesting
+that the remote processes be initiated. The {\tt slurmd}'s begin execution of
+the tasks and connect back to \srun\ for stdout and stderr. This process and
+the other initiation modes are described in more detail below.
+
+\subsubsection{Interactive mode initiation}
+
+\begin{figure}[tb]
+\centerline{\epsfig{file=figures/interactive-job-init.eps,scale=0.5} }
+\caption{\small Interactive job initiation. \srun\ simultaneously allocates
+nodes
+         and a job step from \slurmctld\ then sends a run request to all
+         \slurmd 's in job. Dashed arrows indicate a periodic request that
+         may or may not occur during the lifetime of the job.}
+\label{init-interactive}
+\end{figure}
+
+Interactive job initiation is illustrated in Figure~\ref{init-interactive}.
+The process begins with a user invoking \srun\ in interactive mode.
+In Figure~\ref{init-interactive}, the user has requested an interactive
+run of the executable ``{\tt cmd}'' in the default partition.
+
+After processing command line options, \srun\ sends a message to
+\slurmctld\ requesting a resource allocation and a job step initiation.
+This message simultaneously requests an allocation (or job) and a job step.
+The \srun\ waits for a reply from {\tt slurmctld}, which may not come instantly
+if the user has requested that \srun\ block until resources are available.
+When resources are available
+for the user's job, \slurmctld\ replies with a job step credential, list of
+nodes that were allocated, cpus per node, and so on. The \srun\ then sends
+a message each \slurmd\ on the allocated nodes requesting that a job
+step be initiated. The \slurmd 's verify that the job is valid using
+the forwarded job step credential and then respond to \srun .
+
+Each \slurmd\ invokes a job thread to handle the request, which in turn
+invokes a task thread for each requested task. The task thread connects
+back to a port opened by \srun\ for stdout and stderr. The host and
+port for this connection is contained in the run request message sent
+to this machine by \srun . Once stdout and stderr have successfully
+been connected, the task thread takes the necessary steps to initiate
+the user's executable on the node, initializing environment, current
+working directory, and interconnect resources if needed.
+
+Once the user process exits, the task thread records the exit status and
+sends a task exit message back to \srun . When all local processes
+terminate, the job thread exits. The \srun\ process either waits
+for all tasks to exit, or attempt to clean up the remaining processes
+some time after the first task exits.
+Regardless, once all
+tasks are finished, \srun\ sends a message to the \slurmctld\ releasing
+the allocated nodes, then exits with an appropriate exit status.
+
+When the \slurmctld\ receives notification that \srun\ no longer needs
+the allocated nodes, it issues a request for the epilog to be run on each of
+the \slurmd 's in the allocation. As \slurmd 's report that the epilog ran
+successfully, the nodes are returned to the partition.
+
+
+\subsubsection{Batch mode initiation}
+
+\begin{figure}[tb]
+\centerline{\epsfig{file=figures/queued-job-init.eps,scale=0.5} }
+\caption{\small Queued job initiation.
+         \slurmctld\ initiates the user's job as a batch script on one node.
+         Batch script contains an srun call which initiates parallel tasks
+         after instantiating job step with controller. The shaded region is
+         a compressed representation and is illustrated in more detail in the
+         interactive diagram (Figure~\ref{init-interactive}).}
+\label{init-batch}
+\end{figure}
+
+Figure~\ref{init-batch} illustrates the initiation of a batch  job in SLURM.
+Once a batch job is submitted, \srun\ sends a batch job request
+to \slurmctld\ that contains the input/output location for the job, current
+working directory, environment, requested number of nodes. The
+\slurmctld\ queues the request in its priority ordered queue.
+
+Once the resources are available and the job has a high enough priority,
+\slurmctld\ allocates the resources to the job and contacts the first node
+of the allocation requesting that the user job be started. In this case,
+the job may either be another invocation of \srun\ or a {\em job script} which
+may have multiple invocations of \srun\ within it. The \slurmd\ on the remote
+node responds to the run request, initiating the job thread, task thread,
+and user script. An \srun\ executed from within the script detects that it
+has access to an allocation and initiates a job step on some or all of the
+nodes within the job.
+
+Once the job step is complete, the \srun\ in the job script notifies the
+\slurmctld\, and terminates. The job script continues executing and may
+initiate further job steps. Once the job script completes, the task
+thread running the job script collects the exit status and sends a task exit
+message to the \slurmctld . The \slurmctld\ notes that the job is complete
+and requests that the job epilog be run on all nodes that were allocated.
+As the \slurmd 's respond with successful completion of the epilog,
+the nodes are returned to the partition.
+
+\subsubsection{Allocate mode initiation}
+
+\begin{figure}[tb]
+\centerline{\epsfig{file=figures/allocate-init.eps,scale=0.5} }
+\caption{\small Job initiation in allocate mode. Resources are allocated and
+         \srun\ spawns a shell with access to the resources. When user runs
+         an \srun\ from within the shell, the a job step is initiated under
+         the allocation.}
+\label{init-allocate}
+\end{figure}
+
+In allocate mode, the user wishes to allocate a job and interactively run
+job steps under that allocation. The process of initiation in this mode
+is illustrated in Figure~\ref{init-allocate}. The invoked \srun\ sends
+an allocate request to \slurmctld , which, if resources are available,
+responds with a list of nodes allocated, job id, etc. The \srun\
+process spawns a shell on the user's terminal with access to the
+allocation, then waits for the shell to exit at which time the job
+is considered complete.
+
+An \srun\ initiated within the allocate sub-shell recognizes that it
+is running under an allocation and therefore already within a job. Provided
+with no other arguments, \srun\ started in this manner initiates a job
+step on all nodes within the current job. However, the user may select
+a subset of these nodes implicitly.
+
+An \srun\ executed from the sub-shell reads the environment and
+user options, then notify the controller that it is starting a job step
+under the current job. The \slurmctld\ registers the job step and responds
+with a job credential. The \srun\ then initiates the job step using the same
+general method as described in the section on interactive job initiation.
+
+When the user exits the allocate sub-shell, the original \srun\ receives
+exit status, notifies \slurmctld\ that the job is complete, and exits.
+The controller runs the epilog on each of the allocated nodes, returning
+nodes to the partition as they complete the epilog.
diff --git a/doc/jsspp/survey.tex b/doc/jsspp/survey.tex
new file mode 100644
index 0000000000000000000000000000000000000000..2c4c0c922622afdb05a06ab8d7089756665a851c
--- /dev/null
+++ b/doc/jsspp/survey.tex
@@ -0,0 +1,319 @@
+\section{Related Work}
+\subsection*{Portable Batch System (PBS)}
+
+The Portable Batch System (PBS)~\cite{PBS}
+is a flexible batch queuing and 
+workload management system originally developed by Veridian Systems 
+for NASA.  It operates on networked, multi-platform UNIX environments, 
+including heterogeneous clusters of workstations, supercomputers, and 
+massively parallel systems. PBS was developed as a replacement for 
+NQS (Network Queuing System) by many of the same people.
+
+PBS supports sophisticated scheduling logic (via the Maui 
+Scheduler\footnote{http://superclustergroup.org/maui}). 
+PBS spawn's daemons on each 
+machine to shepherd the job's tasks.
+It provides an interface for administrators to easily 
+interface their own scheduling modules.  PBS can support 
+long delays in file staging with retry.  Host 
+authentication is provided by checking port numbers (low ports numbers are only 
+accessible to user root).  Credential service is used for user authentication. 
+It has the job prolog and epilog feature.
+PBS Supports 
+high priority queue for smaller "interactive" jobs.  Signal to daemons 
+causes current log file to be closed, renamed with 
+time-stamp, and a new log file created.
+
+Although the PBS is portable and has broad user base, it has following drawbacks.
+First, PBS is implementted using single thread and hence exhibits poor performance
+especially when a compute node in the system dies in which case the PBS
+tries to contact down nodes, while other activities must wait. Second, PBS
+has weak mechanism for starting and cleaning up parallel jobs. Finally, the PBS
+has very poor scalability and is not suitable for large clusters.
+%Specific complaints about PBS from members of the OSCAR group (Jeremy Enos, 
+%Jeff Squyres, Tim Mattson):
+%\begin{itemize}
+%\item Sensitivity to hostname configuration on the server; improper 
+%      configuration results in hard to diagnose failure modes.  Once 
+%      configuration is correct, this issue disappears.
+%\item When a compute node in the system dies, everything slows down.  
+%      PBS is single-threaded and continues to try to contact down nodes,
+%      while other activities like scheduling jobs, answering qsub/qstat 
+%      requests, etc., have to wait for a complete timeout cycle before being
+%      processed.
+%\item Default scheduler is just FIFO, but Maui can be plugged in so this
+%      is not a big issue.
+%\item Weak mechanism for starting/cleaning up parallel jobs (pbsdsh).
+%      When a job is killed, pbsdsh kills the processes it started, but
+%      if the process doesn't die on the first shot it may continue on.
+%\item PBS server continues to mark specific nodes offline, even though they 
+%      are healthy.  Restarting the server fixes this.
+%\item Lingering jobs.  Jobs assigned to nodes, and then bounced back to the 
+%      queue for any reason, maintain their assignment to those nodes, even 
+%      if another job had already started on them.  This is a poor clean up 
+%      issue.
+%\item When the PBS server process is restarted, it puts running jobs at risk.
+%\item Poor diagnostic messages.  This problem can be as serious as ANY other 
+%      problem.  This problem makes small, simple problems turn into huge 
+%      turmoil occasionally.  For example, the variety of symptoms that arise 
+%      from improper hostname configuration.  All the symptoms that result are 
+%      very misleading to the real problem.
+%\item Rumored to have problems when the number of jobs in the queues gets
+%      large.
+%\item Scalability problems on large systems.
+%\item Non-portable to Windows
+%\item Source code is a mess and difficult for others (e.g. the open source
+%      community) to improve/expand.
+%\item Licensing problems (see below).
+%\end{itemize}
+%The one strength mentioned is PBS's portability and broad user base.
+%
+%PBS is owned by Veridian and is released as three separate products with
+%different licenses: {\em PBS Pro} is a commercial product sold by Veridian;
+%{\em OpenPBS} is an pseudo open source version of PBS that requires 
+%registration; and
+%{\em PBS} is a GPL-like, true open source version of PBS.
+%
+%Bug fixes go into PBS Pro.  When a major revision of PBS Pro comes out,
+%the previous version of PBS Pro becomes OpenPBS, and the previous version
+%of OpenPBS becomes PBS.  The delay getting bug fixes (some reported by the
+%open source community) into the true open source version of PBS is the source
+%of some frustration.
+
+\subsection*{Maui Scheduler}
+
+Maui Scheduler~\cite{Maui}
+is an advance reservation HPC batch scheduler for use with SP, 
+O2K, and UNIX/Linux clusters. It is widely used to extend the 
+functionality of PBS and LoadLeveler
+
+\subsection*{Distributed Production Control System (DPCS)}
+
+The Distributed Production Control System (DPCS)~\cite{DPCS}
+is a resource manager developed at Lawrence Livermore National Laboratory (LLNL). 
+The DPCS provides basic data collection and reporting
+mechanisms for prject-level, near real-time accounting and resource allocation
+to customers with established limits per customers' organization budgets,
+In addition, the DPCS evenly distributes workload across available computers
+and supports dynamic reconfiguration and graceful degradation of service to prevent
+overuse of a computer where not authorized.
+%DPCS is (or will soon be) open source, although its use is presently 
+%confined to LLNL. The development of DPCS began in 1990 and it has 
+%evolved into a highly scalable and fault-tolerant meta-scheduler 
+%operating on top of LoadLeveler, RMS, and NQS. DPCS provides: 
+%\begin{itemize}
+%\item Basic data collection and reporting mechanisms for project-level, 
+%      near real-time accounting.
+%\item Resource allocation to customers with established limits per 
+%      customers' organizational budgets. 
+%\item Proactive delivery of services to organizations that are relatively 
+%      underserviced using a fair-share resource allocation scheme.
+%\item Automated, highly flexible system with feedback for proactive delivery 
+%      of resources.
+%\item Even distribution of the workload across available computers.
+%\item Flexible prioritization of production workload, including "run on demand."
+%\item Dynamic reconfiguration and re-tuning.
+%\item Graceful degradation in service to prevent overuse of a computer where 
+%      not authorized.
+%\end{itemize}
+
+While DPCS does have these attractive characteristics, it supports only a 
+limited number of computer systems: IBM RS/6000 and SP, Linux with RMS, 
+Sun Solaris, and Compaq Alpha. DPCS also lacks commercial support.
+
+\subsection*{LoadLeveler}
+
+LoadLeveler~cite{LoadLevelerManual,LoadLevelerWeb} 
+is a proprietary batch system and parallel job manager by 
+IBM. LoadLeveler supports few non-IBM systems. Very primitive 
+scheduling software exists and other software is required for reasonable 
+performance such as Maui Scheduler and DPCS. 
+The LoadLeveler is simple and very flexible queue and job class structure is available 
+operating in "matrix" fashion. 
+The biggest problem of the LoadLeveler is its extremely poor scalability. It was observed that 
+it took more than 20 minutes to start-up a 512-task job using LoadLeveler on one of
+the IBM SP machines at LLNL.
+In addition, all jobs must be initiated through the LoadLeveler, and a special version of
+MPI is requested to run a parallel job.
+%Many configuration files exist with signals to 
+%daemons used to update configuration (like LSF, good). All jobs must 
+%be initiated through LoadLeveler (no real "interactive" jobs, just 
+%high priority queue for smaller jobs). Job accounting is only available 
+%on termination (very bad for long-running jobs). Good status 
+%information on nodes and LoadLeveler daemons is available. LoadLeveler 
+%allocates jobs either entire nodes or shared nodes ,depending upon configuration.
+%
+%A special version of MPI is required. LoadLeveler allocates 
+%interconnect resources, spawns the user's processes, and manages the 
+%job afterwards. Daemons also monitor the switch and node health using 
+%a "heart-beat monitor." One fundamental problem is that when the 
+%"Central Manager" restarts, it forgets about all nodes and jobs. They 
+%appear in the database only after checking in via the heartbeat. It 
+%needs to periodically write state to disk instead of doing 
+%"cold-starts" after the daemon fails, which is rare. It has the job 
+%prolog and epilog feature, which permits us to enable/disable logins 
+%and remove stray processes.
+%
+%LoadLeveler evolved from Condor, or what was Condor a decade ago. 
+%While I am less familiar with LSF and Condor than LoadLeveler, they 
+%all appear very similar with LSF having the far more sophisticated 
+%scheduler. We should carefully review their data structures and 
+%daemons before designing our own.
+%
+\subsection*{Load Sharing Facility (LSF)}
+
+LSF~\cite{LSF}
+is a proprietary batch system and parallel job manager by 
+Platform Computing. Widely deployed on a wide variety of computer 
+architectures, it has sophisticated scheduling software including 
+fair-share, backfill, consumable resources, an job preemption and 
+very flexible queue structure.
+It also provides good status information on nodes and LSF daemons.
+The LSF share many of its shortcomings with the LoadLeveler: job initiation only
+through LSF, requirement of a spwcial MPI library, etc.
+%Limits are available on both a per process bs per-job  
+%basis. Time limits include CPU time and wall-clock time. Many 
+%configuration files with signals to daemons used to update 
+%configuration (like LoadLeveler, good). All jobs must be initiated 
+%through LSF to be accounted for and managed by LSF ("interactive" 
+%jobs can be executed through a high priority queue for 
+%smaller jobs). Job accounting only available in near real-time (important 
+%for long-running jobs). Jobs initiated from same directory as 
+%submitted from (not good for computer centers with diverse systems 
+%under LSF control). Good status information on nodes and LSF daemons. 
+%Allocates jobs either entire nodes or shared nodes depending upon 
+%configuration.
+%
+%A special version of MPI is required. LSF allocates interconnect 
+%resources, spawns the user's processes, and manages the job 
+%afterwards. While I am less familiar with LSF than LoadLeveler, they 
+%appear very similar with LSF having the far more sophisticated 
+%scheduler. We should carefully review their data structures and 
+%daemons before designing our own.
+
+
+\subsection*{Condor}
+
+Condor~\cite{Condor,Litkow88,Basney97}
+is a batch system and parallel job manager 
+developed by the University of Wisconsin. 
+Condor was the basis for IBM's LoadLeveler and both share very similar 
+underlying infrastructure. Condor has a very sophisticated checkpoint/restart 
+service that does not rely upon kernel changes, but a variety of 
+library changes (which prevent it from being completely general). The 
+Condor checkpoint/restart service has been integrated into LSF, 
+Codine, and DPCS. Condor is designed to operate across a 
+heterogeneous environment, mostly to harness the compute resources of 
+workstations and PCs. It has an interesting "advertising" service. 
+Servers advertise their available resources and consumers advertise 
+their requirements for a broker to perform matches. The checkpoint 
+mechanism is used to relocate work on demand (when the "owner" of a 
+desktop machine wants to resume work).
+
+
+
+\subsection*{Memory Channel}
+
+Memory Channel is a high-speed interconnect developed by 
+Digital/Compaq with related software for parallel job execution. 
+Special version of MPI required. The application spawns tasks on 
+other nodes. These tasks connect themselves to the high speed 
+interconnect. No system level tool to spawns the tasks, allocates 
+interconnect resources, or otherwise manages the parallel job (Note: 
+This is sometimes a problem when jobs fail, requiring system 
+administrators to release interconnect resources. There are also 
+performance problems related to resource sharing).
+
+\subsection*{Linux PAGG Process Aggregates}
+
+
+PAGG~\cite{PAGG}
+consists of modifications to the linux kernel that allows
+developers to implement Process AGGregates as loadable kernel modules.
+A process aggregate is defined as a collection of processes that are
+all members of the same set. A set would be implemented as a container
+for the member processes. For instance, process sessions and groups
+could have been implemented as process aggregates.
+
+\subsection*{Beowulf Distributed Process Space (BPROC)}
+
+
+The Beowulf Distributed Process Space 
+(BPROC)
+is set of kernel
+modifications, utilities and libraries which allow a user to start
+processes on other machines in a Beowulf-style cluster~\cite{BProc}.  Remote
+processes started with this mechanism appear in the process table
+of the front end machine in a cluster. This allows remote process
+management using the normal UNIX process control facilities. Signals
+are transparently forwarded to remote processes and exit status is
+received using the usual wait() mechanisms.
+
+%\subsection{xcat}
+%
+%Presumably IBM's suite of cluster management software 
+%(xcat\footnote{http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246041.html})
+%includes a batch system.  Look into this.
+%
+%\subsection{CPLANT}
+%
+%CPLANT\footnote{http://www.cs.sandia.gov/cplant/} includes
+%Parallel Job Launcher, Compute Node Daemon Process,
+%Compute Node Allocator, Compute Node Status Tool.
+%
+%\subsection{NQS} 
+%
+%NQS\footnote{http://umbc7.umbc.edu/nqs/nqsmain.html}, 
+%the Network Queueing System, is a serial batch system.
+%
+\subsection*{LAM / MPI}
+
+Local Area Multicomputer (LAM)~\cite{LAM}
+is an MPI programming environment and development system for heterogeneous 
+computers on a network. 
+With LAM, a dedicated cluster or an existing network
+computing infrastructure can act as one parallel computer solving
+one problem.  LAM features extensive debugging support in the
+application development cycle and peak performance for production
+applications. LAM features a full implementation of the MPI
+communication standard.
+
+%\subsection{MPICH}
+%
+%MPICH\footnote{http://www-unix.mcs.anl.gov/mpi/mpich/}
+%is a freely available, portable implementation of MPI,
+%the Standard for message-passing libraries.
+%
+%\subsection{Quadrics RMS}
+%
+%Quadrics
+%RMS\footnote{http://www.quadrics.com/downloads/documentation/}
+%(Resource Management System) is a cluster management system for 
+%Linux and Tru64 which supports the
+%Elan3 interconnect.  
+%
+%\subsection{Sun Grid Engine}
+%
+%SGE\footnote{http://www.sun.com/gridware/} is now proprietary.
+%
+%
+%\subsection{SCIDAC}
+%
+%The Scientific Discovery through Advanced Computing (SciDAC) 
+%project\footnote{http://www.scidac.org/ScalableSystems}
+%has a Resource Management and Accounting working group
+%and a white paper\cite{Res2000}. Deployment of a system with 
+%the required fault-tolerance and scalability is scheduled 
+%for June 2006.
+%
+%\subsection{GNU Queue}
+%
+%GNU Queue\footnote{http://www.gnuqueue.org/home.html}.
+%
+%\subsection{Clubmask}
+%Clubmask\footnote{http://clubmask.sourceforge.net} is based on bproc.
+%Separate queueing system?
+%
+%\subsection{SQMX}
+%Part of the SCE Project\footnote{http://www.opensce.org/}, 
+%SQMX\footnote{http://www.beowulf.org/pipermail/beowulf-announce/2001-January/000086.html} is worth taking a look at.