Page 1 of 1

WLM determines batch job run destination ?

PostPosted: Tue Apr 11, 2017 4:03 pm
by ocjohnh
Hi, I have a layman question that I need help.

Assuming we have jobs run in a SYSPLEX which contains multiple images,

while WLM jobs from different sysplex images are put into the JES common queue for execution,
I know WLM will control the initiators count,
but does the WLM determine which sysplex image for the job to be executed ?

Re: WLM determines batch job run destination ?

PostPosted: Tue Apr 11, 2017 5:28 pm
by Robert Sample
The answer is different for JES2 and JES3. With JES2, the processor the job executes on is, essentially, random. Each processor attempts to schedule the next available job in the queue when an initiator becomes available. With JES3, processing is centrally controlled by the global machine and as initiators become available the global machine schedules the next job in the queue. WLM may start initiators on a system should the workload back up; which processor(s) get the additional initiators depends upon WLM.

In all cases, JES control statements (such as /*ROUTE or //*PROCESS or //*MAIN) can impact which processor a given job executes on.

Re: WLM determines batch job run destination ?

PostPosted: Tue Apr 11, 2017 10:47 pm
by ocjohnh
Hi Robert,

Based on your reply, it seems to me that WLM doesn't determine where the image of the job to run, does it ?
I might not be able to distinguish the difference between processors and sysplex images.

In our case, we got a sysplex contains two sysplex images that reside on two separate z/OS CPU box, so if our
jobs (only with SCHENV coded but those SE resources always enable on all images by default) are put into JES common queue by a job scheduler, then you are saying at the first place the jobs are randomly run in any of the two images as long as those two images also have free initiators.

However in longer terms, because WLM can control the initiators count in any particular image, so eventually the WLM will have a say on the job run destination ? because when a particular image get no more new WLM initiators started, then any new jobs will be forced to run on another image. am I right ?

I am so confused, because there is no any IBM manual I can find talking about job starting criteria

Re: WLM determines batch job run destination ?

PostPosted: Tue Apr 11, 2017 11:51 pm
by Robert Sample
I might not be able to distinguish the difference between processors and sysplex images.
I think one issue is terminology. A processor is a machine and may have anywhere from 1 to 141 LPARS on it. A sysplex is a set of LPARS that communicate via cross-system coupling facility (XCF). It is easy to distinguish between processors and sysplex images -- a processor is a physical box while a sysplex image is software running on one (or more) boxes.

WLM has little, if anything, to do with where jobs execute. WLM is used to manage workloads -- as in relative performance of address spaces against each other and against the defined goals -- not to execute jobs. From the z/OS Basics manual at https://www.ibm.com/support/knowledgece ... hjeses.htm :
The job entry subsystem (JES) helps z/OS® receive jobs, schedule them for processing, and determine how job output is processed.

Batch processing is the most fundamental function of z/OS. Many batch jobs are run in parallel and JCL is used to control the operation of each job. Correct use of JCL parameters (especially the DISP parameter in DD statements) allows parallel, asynchronous execution of jobs that may need access to the same data sets.

An initiator is a system program that processes JCL, sets up the necessary environment in an address space, and runs a batch job in the same address space. Multiple initiators (each in an address space) permit the parallel execution of batch jobs.
One goal of an operating system is to process work while making the best use of system resources. To achieve this goal, resource management is needed during key phases to do the following:

Before job processing, reserve input and output resources for jobs.
During job processing, manage spooled SYSIN and SYSOUT data.
After job processing, free all resources used by the completed jobs, making the resources available to other jobs.

z/OS shares with the job entry subsystem (JES) the management of jobs and resources. JES receives jobs into the system, schedules them for processing by z/OS, and controls their output processing. JES is the manager of the jobs waiting in a queue. It manages the priority of the jobs and their associated input data and output results. The initiator uses the statements in the JCL records to specify the resources required of each individual job after it is released (dispatched) by JES.

IBM® provides two kinds of job entry subsystems: JES2 and JES3. In many cases, JES2 and JES3 perform similar functions.

During the life of a job, both JES and the z/OS base control program control different phases of the overall processing. Jobs are managed in queues: Jobs that are waiting to run (conversion queue), currently running (execution queue), waiting for their output to be produced (output queue), having their output produced (hard-copy queue), and waiting to be purged from the system (purge queue).
JES and z/OS do job scheduling; WLM does not.
because WLM can control the initiators count in any particular image
This depends upon the site -- the systems programmers may have JES controlling the initiators (which means the operators and system programmers in practical terms) or WLM controlling the initiators. Note that when WLM is controlling the initiators, it is managing the initiator count to improve system throughput; it is not doing anything to actually handle the jobs.

then you are saying at the first place the jobs are randomly run in any of the two images as long as those two images also have free initiators.
If you are saying "image" to mean "LPAR" then yes. If you are saying "image" to mean "jesplex" or "sysplex" then largely no. And note that jobs are transient -- so initiators are ALWAYS becoming available as jobs complete. One of the tuning tasks system programmers perform is to set the number of address spaces for each LPAR; this is what sets the limit on initiators and system programmers do NOT want to run out of "free initiators" as you call them. Systems do not respond well to a lack of initiators since they are used for started tasks, TSO users, batch jobs, and OMVS processes.

To some degree, you will remain forever confused. The only ones who know exactly how JES and z/OS handle jobs are those working for IBM in the development groups for JES and z/OS. Unless you go to work for IBM and get into one of these groups, all you can do is read the IBM manuals and hope they provide enough illumination upon the topic(s) you're confused about.

Re: WLM determines batch job run destination ?

PostPosted: Wed Apr 12, 2017 1:00 am
by ocjohnh
Hi Robert,

first of all, thank you so much for your help.

Yes, the image I mentioned actually meant LPAR. so I put my question in short,
if WLM is not good at control which LPAR for the job to execute, is there any practically ways you be aware to evenly distribute the batch jobs (i.e. more or less in a random way) between two LPARs of a single sysplex in our case ?

we are not targeting to made the job distribution in a controlled manner. We don't want to assign specific SYSAFF to specific jobs .
and we don't want to make the job always run on particular LPAR by using SCNENV approach.

Re: WLM determines batch job run destination ?

PostPosted: Wed Apr 12, 2017 1:31 am
by Robert Sample
is there any practically ways you be aware to evenly distribute the batch jobs (i.e. more or less in a random way) between two LPARs of a single sysplex in our case ?
I am unclear about why you want to do this. Are you experiencing problems with workload distribution? There are some fairly high-level ways to manage the distribution of work (LPAR weighting, for example) but unless you're seeing some really extreme discrepancies between the LPARS you're usually better off not doing anything. Our production LPAR runs on 2 processors and it's not unusual to see a 5 or 10% utilization difference between the processors.

What are the WLM workload PI (performance index) numbers? If they're all running 1.0 or less, then no matter how skewed the job distribution the work is being completed -- in which case you're worrying about nothing. If one service class has a PI of 4 or more while the rest are less than 1, then you may have something to look at (but if the service class with the PI of 4 is running something like a job scheduler, then the PI of 4 may be normal since there's probably not enough work for WLM to properly assess that service class). If all the PI are 2 or higher, then you may need to review your WLM policy (or get a bigger machine, or both). Remember WLM is looking at the workload 4 times a second so you've got to have fairly consistent work to get accurate PI numbers; spotty intermittent work where nothing happens for a bit and then a lot happens for a short time is probably the worst workload for WLM since it will repeatedly under / over allocate resources to that service class.

Re: WLM determines batch job run destination ?

PostPosted: Wed Apr 12, 2017 2:02 am
by ocjohnh
Hi Robert,

yes, we are experiencing problems with workload distribution, at the moment the majority of the batch jobs in our sysplex are running with JES init and with SYSAFF on a particular LPAR in the SYSPLEX, which make the batch workload out of balance between two LPARs in the same SYSPLEX. So we figure using WLM init and get rid of SYSAFF can more or less correct this out of balance situation. Yet as far as I learnt here this might not be as simple as that, I can go to our sys support to learn about WLM PI if required.

Re: WLM determines batch job run destination ?

PostPosted: Wed Apr 12, 2017 2:28 am
by Robert Sample
I'd say start by talking to your site support group. The imbalance in batch jobs may be deliberate -- LPAR A, for example, could be running all the CICS regions so that balances LPAR B which runs most of the batch jobs. If you start changing things without understanding the overall picture of the LPAR workloads, you could drastically impact performance without realizing it.

One of the system programmer guidelines that is important to follow: don't change two things at once; only change one thing at a time. If you change from JES to WLM initiators while removing the SYSAFF, for example, and you have performance issues then which change caused the performance problem? Answer: you don't know since you changed two things at once.

And I repeat: are you sure you have a problem? Merely having an imbalance in batch jobs does not indicate any kind of problem. Look at the SMF type 70 and 72 records to see how WLM says the work is doing before changing anything. If your site has MXG or another SMF tool (or RMF III), review the WLM reports.