ivarch.com: Portfolio: Batch framework

This page describes a proprietary framework and supporting scripts to allow batch jobs to be fully defined by a shell script placed in a directory, with supporting functions for logging, status reporting, and failure alerting built in, that I built while employed as a Linux system administrator and Linux architect.

Key details
Brief description:	A framework to easily extend existing shell scripts to support flexible scheduling, with dependencies, conflict avoidance, centralised logging, and status reporting built in.
Consumer:	ERP development team, Linux system administration team
Impact to consumer:	Improved observability of scheduled jobs Reduced risk of human error Shorter time from code commit to deployment Improved security posture
Technical features:	Automatic delivery of status and logs to a central point as well as to local files Generates metrics and status files plus an LLD file for the Zabbix monitoring system to detect jobs automatically and raise alerts as appropriate Minimal overhead with no complicated frameworks or dependencies
Technologies used:	Bash

When packaging and deploying internal software, there is a division between the "application" side and the "operating system" side. For some internal development teams, particularly the ERP developers working on the core Cobol application, this can make it difficult to manage scheduled jobs - granting the developers direct access to the cron scheduler would put the whole system at risk of accidental compromise, and is difficult to justify from an internal controls perspective. This means that developers have to co-ordinate with the Linux system admin team to set up or modify scheduled tasks.

Many scheduled jobs relate to processing of financially relevant information, and so from an internal controls perspective, there is a need to ensure that all jobs have consistent logging and failure alerting mechanisms - so that, for example, if a critical invoice processing job fails, an alert will definitely be raised.

I developed the batch framework to close these gaps without having to delegate an unsafe amount of responsibility to developers, and without having to introduce a large, complex, maintenance-heavy ecosystem such as Control-M or UC4.

Once the framework has been set up by the system administration team, new scheduled jobs can be created by placing a shell script in a particular directory - usually via deployment of an OS-native package along with the rest of the internal application. The job's shell script contains the job itself, plus all information necessary for scheduling and alerting. The framework provides supporting functions for the script to call - functions for logging, status reporting, and failure alerting built in.

Each job is run by the framework rather than directly from cron, so all output is automatically captured to local log files regardless of whether the framework functions are used by the script. Job status, output, information about running jobs' ETA, performance of a job relative to previous run times, and more is provided by a batch status viewer tool I developed separately, providing IT service management, developers, and other teams with greater visibility of these processes without having to grant them access to the servers on which the jobs are running.

The framework generates a low-level discovery file for the Zabbix monitoring system so that the template I created for Zabbix can automatically start monitoring any new jobs. Metrics are tracked for job running time, time since last successful run, success or failure of the last run, and so on; and alerts are automatically generated based on the parameters built into each job, reducing the likelihood of failures being missed because operators forgot to set up alerting.

The following parameters can be set for each job, in the comment block at the top of the job's script:

When to run the job; written as the first five fields of a crontab line, and can be specified multiple times. This is the only mandatory parameter.
The maximum run time for the job, after which it will be terminated.
The maximum amount of time permitted between successful runs of this job before an alert will be raised.
Whether to raise an alert if this job is scheduled to start but a previous instance is already running. The default is to silently exit - concurrency locking is built-in.
Email addresses to send a copy of the logs to, if the job fails.
The log file to write to, instead of the default provided by the framework.
Whether to transmit the start, end, and status information to the central log server for the batch status viewer to use (the default is to always do so).
Any other jobs that this job conflicts with, such that this job should not start if those jobs are already running.
In the case of a conflict, whether to wait for the conflicting job to end, and if so, how long to wait before giving up and exiting.
In the case of a conflict, whether to treat an exit due to the conflict as a failure of this job (which will raise an alert), or to treat it as if this job was not scheduled and so fail silently. The "maximum time between successful runs" alert will still operate regardless of this setting.
Any other jobs that must have completed successfully before this job will be allowed run (its prerequisites).
In the case of an incomplete prerequisite that is still running, how long to wait for the prerequisites to finish before giving up and abandoning this run of this job.

When administrators set up the batch framework, they can define multiple job locations, each with their own set of parameters. These parameters include which user to run the jobs as, which central host to transmit logs to, checks to run before starting any of the jobs (such as checking that this is the active node of an active/passive cluster), and commands to include as if they were at the start of every job (such as setting up Cobol runtime environment variables).

Introducing the batch framework to the core ERP system immediately improved the monitoring and alerting of the jobs that were moved over to it. I then started using it even on systems where no delegation of control was required, wherever jobs had more complex needs than a very basic maintenance cron job, and the system admin team are now comfortable using it for any scheduled task that would benefit from automatic monitoring and alerting.

← Back to the Portfolio