I am totally not agreeing with your comments
Well, OK . . . You don't have to agree.
However, most of us have worked on systems that are most likely exponentially larger than your system, it would be well for you to consider how we have successfully done this kind of thing and have not gotten to where you are now.
For every release we need to process thru the job(100 steps) in the test environments
Yup, be a very bad thing is each release was not completely tested. But there is NO business (or technical) reason to run 100 step jobs.
Indeed, if something smaller than a complete release is to be promoted (problem fix, legal requirement, etc) it would be far more efficient to only test/promote what was changed. If "everything is being released, by all means, promote all of the individual processes.
And, yes, most of us "old guys" were implementing "monster" batch jobs when there was no scheduling software. All of the restarts had to be accomodated either in the job itself and/or with special restart process. Which is why any of us who can, use the scheduling systems whenever possible. This also reduces/eliminates the calls in the middle of the night because the :monthly" went down.
That being said, my main concern is still the number of "things" that "go bump in the night"
and require all of this crisis intervention.