In SDLC (software development life cycle) process, creating a batch job is the last step. As the application maintenance grows over period of time and new component added to the overall process can make over batch support complicated. This article provides guidelines to create batch jobs that can be well supported in future. Though this article and the examples in this article focuses on Investment management application, this framework can be applied for another domain
Nomenclature of the job name
Organization have dedicated support team to monitor the batch jobs and the support team have lot of application batch to monitor. In order to make life easy, it is recommended to have nomenclature while naming the job. There are two ways to name your batch job
Short form – ABC12345
First 3 character
Denotes your System
TRD – Trading
REP – Reporting
REC – Reconciliation
Denotes schedule of your job
1 – Daily
2 – Biweekly
3 – Once in 3 weeks
5 – Yearly
9 – On demand/Adhoc
5th & 6th character
Denotes your interface/internal process
your own way of sequence number for your interface
7th & 8th character
Denotes your batch purpose
your own way of sequence number for your job
01 – File watcher
02 – validation job
03 – ETL job
04 – FTP job
In this example,
First 3-character TRD denotes Trading system
4th character 1 denotes Daily job
5th & 6th character 01 denotes Ratings data
7th & 8th character 01 denotes the job is file watcher
Overall TRD10101 denotes, that this is daily job belongs to Trading system and it’s a file watch job for rating file.
If your organization doesn’t have restriction on the file name, then you can follow this naming convention. Job name contains system name followed by schedule, then interface and process name
This example denotes that job is Ratings File watcher job which runs daily for investment system
Structuring of batch job
Now we have defined the naming convention for individual job, now we have to group them appropriately
Grouping of jobs based on Process flow/Events
Group individual jobs based on process like Inbound, Critical, Outbound, FTP
Fig 1. Grouping of jobs based on process flow/events
Inbound – Include all jobs that feeds data to your application. For example, in investment management application batch this group will contain the batch job that process security master, trades files, ratings, foreign exchange, factors, coupons, etc.
Critical – Group all your jobs that does core processing for your system. For example, in investment management system, the core jobs include nightly processing, accounting updates, amortization calculation, market value calculation etc.
Outbound – Group all your jobs that send files to downstream system or reporting extracts.
Grouping of job based on Timelines
Group the batch job based on the timeframe. If your batch cycle for your system starts at 6pm, then you can group your jobs like 6pm-9pm, 9pm-2am,2am-6am.
Fig 2. Grouping of job based on timelines
Designing of individual batch job
- Part 1:
Make sure that job gets triggered by any one of these events
- by time dependency
- by file watcher
- by dependency to existing job
Create late shout if the job does not run before certain time
- Part 2:
Next part is actual job to run. This is the core design of the system.
- Part 3:
Last part is housekeeping. Rename the file produced by the process with date time stamp.
Ability to rerun the job during failure
During SDLC phase, developer should design a process that can be either scheduled to rerun the job or should be aborted which requires manual intervention during failure without causing any data loss or duplicate data. Let’s say if your job does ETL process, if one of the records failed, the developer should have the system designed to either skip the record, continue the batch process and trigger a notification to support/business team on the skipped record or fail the job. This should be identified as part of system design.
Most of the scheduler has feature to rerun a job automatically when it fails first time with time lag. Sometime batch might fail due to network connectivity, enabling this feature can resolve the batch failure automatically.
Fail a job with right attitude
Though our objective of batch framework is to run the jobs without any manual intervention but at same time we wanted to capture the error which are generated during the batch run. Scheduler application like BMC control-m has feature to fail the job based on key word in the output log returned to the scheduler.
Let’s say you have file watcher job and a processing job as a dependency. If the file received from upstream is empty, processing job will complete without any issue. If you have a log as 0 records processed. You can pro-actively fail the job.
This depend on the business rules that are pre-defined.
Let’s prepare for BAD Day
Run book for individual job
Run book is bible for application support team. So, runbook should capture detailed instruction for each job. If job is file watcher, then document the point of contact for the job including phone number, email of the support team responsible for transmitting the file. For processing job, capture the instruction to login procedure to the server, navigating to the application folder, steps to fix the problem such as running a script to exclude a bad record which is causing the batch to fail, mandatory files required to rerun the job, communication procedure to downstream or to the business users on the failure. The run book should also contain escalating procedure. Application support run book is a live document and it cannot be perfect initially and support team should keep updating this document as team gains more knowledge.
Recovering batch to a critical point
Take backup of database at beginning of batch start, at middle of the batch and at end of batch. If your system has non recoverable process, it is better to take a backup before and after the process. For example, investment management application has a process to roll forward accounting system date to next date. This process is very critical and any system failure during this critical process at times can make the system non recoverable. So, it is recommended to take database backup before and after the critical process.
Another advantage of taking backup during the batch is troubleshooting. As application data changes over the nightly batch, finding the root cause of production issue can be really challenging. With this additional backup, application support team can restore test database with backup taken during the batch and proceed with the investigation.
- Archiving /backup of files
After daily batch, archive the files that are received and processed. Zip them and name with date.
For example – InboundArch_12082015
At end of month, move the individual zipped files to corresponding monthly folders. Similarly, at end of year, archive all monthly folders to the year folder.
- Scheduling calendar
It is recommended to align batch jobs to enterprise scheduling calendar. Create custom calendar for scheduling criteria that does not align with enterprise scheduling calendar.
- Disabling a job vs Decommissioning a job
As part of maintenance, you will need to stop running a job. This needs to be done by documenting, validating all the dependencies and make sure there are no upstream and downstream application waiting for this job. Once validated and signed off by all the team first step is to disable the job. Let it run for a month. Finally retire the job.
Reactive vs Pro-active
Application support is reactive but with below additional steps in batch can turn application support from reactive to pro-active
- Create checkpoint jobs at regular intervals or after critical flow. This checkpoint will help one to know how far the batch has run and how long it will take to complete the batch cycle in case of delay.
- Identify the long running jobs and update run book for known errors.
- Track the batch performance on daily basis and evaluate the trends at regular basis. This evaluation will help to determine to pattern/trends of overall batch.
For example, accounting application usually runs longer during month end and first business day. Pro-actively batch support can be informed about this trend and downstream system can be informed about possible delay of file delivery.
- As application support evolves over period of time, team can pro-actively identify bad data in the system that can cause potential failure in overnight batch.
For example, we don’t want to have a bad foreign exchange data in the system. Support team can have a batch job to identify bad foreign exchange before the start of critical process.
- Monthly vs Quarterly vs Yearling vs Special holiday
Accounting/Finance related application have special process or report to be processed during month end, quarter end and year end or special holiday. Support team can pro-actively create checklist to make sure to check if batch jobs are scheduled appropriately.
Batch framework in (SDLC) Software development lifecycle perspective
Now we have seen different component of batch framework, let’s see how we can incorporate these features in software development lifecycle process (both waterfall and agile).
Waterfall methodology is traditional process and it consists of requirement gathering, analysis, design, coding, testing and production release.
As part of requirement gathering, it is essential to identify below questions
- Availability of the system/application to end users. This will help us to identify the available window for our batch to run. Based on this available window, we can decide if can run some jobs in parallel to gain time. In case of gathering requirement for end user report or extract to downstream system, then we should identify the hard target time to deliver the report or extract. In case of error during the batch, does the user or downstream system accept extract or report that has missing records.
- Schedule of the process/extract. For an extract or report, identify if the extract/report is daily or monthly.
Analysis & Design:
- As developers/programmers perform analysis and design for the core requirement, it is necessary to design the job for failure scenario and rerun scenario. Also, it is essential to design the process that can run in parallel without causing locks on any files or database tables. If a job cannot run in parallel with other process, then the job should be made as sequential. Most of the scheduling application have a concept of resource allocation to the scheduler. If we decide to run all the job is sequential then the allocated resource should be only 1 at any point of time. The scheduler will wait for completion of job before starting the next job.
- Test plan should contain the scope of batch stress testing, batch regression testing and stimulated failure scenario.
- Batch stress testing
- Some upstream system will send large number of transactions during certain day and it is critical to perform stress testing. For example, mortgage backed securities pay date is 15th of the month and it is expected to receive lot of transaction on 15th of the month
- Batch regression testing
- It is recommended to run the complete batch for at least a week duration in test environment. If there is a system modification and if it affects the monthly process, then it is necessary to run complete month end scheduled jobs.
- Stimulated failure scenario.
- As part of stress testing and batch regression testing, it is not possible to cover all the failure scenario. So, it is responsibility of the project team to stimulate the failure scenario specific to batch and test it.
SDLC – Agile methodology
Agile methodology follows an iterative development approach because of this planning, development, prototyping and other software development phases may appear more than once. So, implementing batch framework should be part of overall process. If possible, as part sprint planning there should be separate task for batch testing.
Often team thinks that batch is just calling a process in scheduling tool, but bad design and inconsistent approach can make application support complicated. Implementation of batch framework will create consistency across application