Discussion: SORTOUT different for same i/p over diff runs



IBM's flagship sort product DFSORT for sorting, merging, copying, data manipulation and reporting. Includes ICETOOL and ICEGENER

Discussion: SORTOUT different for same i/p over diff runs

Postby Aki88 » Tue Mar 25, 2014 11:30 pm

Hello,

Was just going through my old mails when I came across this one issue (more aptly an observation) we'd come across while we were tweaking the DFSORT installation defaults for optimum memory usage during the batch window for a particular program using internal sort for one of the sites; this was an year or so back though. Posting here, out of curiosity; this might just help straighten those curves in my understanding of DFSORT. ;) :)
We had a batch COBOL program which would take a KSDS as input, perform internal sort on a set of fields from the file; basis the sorted data, it'd segregate the records and then build a report. KSDS Key being the first 22 characters. Record Length of around 300; all records being fixed-length. The output report of record length 150 (o/p being a sequential file); the sorting happening on the key of KSDS and a few other fields, occurring at positions after the 100th column (I don't remember the exact specs though; my apologies); the report could have duplicate keys (by key, i mean the entire chunk of data on which SORT has been performed, for example:
SORT FIELDS=(1,36,CH,A)
, in which case, the o/p file will have the starting 36bytes sorted) post rearrangement being done by the internal sort [keeping in mind that the sort was performed on different positions, and data rearrangement having happened at the time of writing the report].

The number of records in the i/p file would usually vary in the range of 1million to 6million on peak days.

Coming to the observation now: we'd noticed on several occasions, that whenever the i/p varied from 1million to 2.5 million, the sorted records would appear in same order irrespective of how many times the program was rerun (program reruns - in case of business requirement for regeneration of reports); whereas, when the i/p record count was large, the records were indeed sorted (basis the SORT key), but the remaining data (please read- data in remaining columns) would be arranged differently during every run; please refer below sample o/p; assuming sort has been run on first 6 columns, record arrangement variation has been shown (this is only a sample scenario as I do not have the actual production file with me as of now; the keys in i/p have been duplicated keeping in mind the rearrangement scenario):

I/P
123456FGDABCE
123456ABCDEFG
123456EFGDABC
123456CDEFGAB

Outputs:
----------
RUN1
123456ABCDEFG
123456CDEFGAB
123456EFGDABC
123456FGDABCE

RUN2
123456CDEFGAB
123456ABCDEFG
123456EFGDABC
123456FGDABCE

RUN3
123456ABCDEFG
123456EFGDABC
123456CDEFGAB
123456FGDABCE


Notice the change in record arrangement after the keys on which the records were sorted; this would happen only and only when the number of records in i/p file was fairly large, on the lines of 3 million or greater.

Curious as to why would SORT result in different record arrangements for same i/p, same sorting parameters, if SORT is 'in the end' running the same algorithm for sorting??

Unsure if anyone else has had similar observations; but if one has seen this happening before, and can share an insight, my understanding might just get better.

Cheers! :)
Aki88
 
Posts: 381
Joined: Tue Jan 28, 2014 1:52 pm
Has thanked: 33 times
Been thanked: 36 times

Re: Discussion: SORTOUT different for same i/p over diff run

Postby dick scherrer » Tue Mar 25, 2014 11:53 pm

Hello,

You need to specify EQUALS in your sort control. What you are seeing is what can happen when only the "key" is used to sort. There is no guarantee the file will be in order by more than the "sort key".

When you specify EQUALS, records with the same "key" are kept in the order they were processed.
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Re: Discussion: SORTOUT different for same i/p over diff run

Postby Aki88 » Wed Mar 26, 2014 12:07 am

Hey Dick,

Thank you; I agree with you on that one; and we'd usually use this whenever we'd write a card outside a COBOL code :P .
The application program being referred to is part of the global installation which is upgraded only during Compliance releases (again- only if product designers feel the need of an enhancement); we can definitely take business approvals and tweak the parm though; was wondering as to why this would happen, personally speaking, SORT being such a robust product (a big fan here, thumbs up to Frank and team for such an awesome product), why would a record be rearranged in sequence of appearance, what would make this happen; remembering that our source code hasn't changed between the runs ideally o/p should be same.
Please correct me if I have it wrong there.

Thanks.
Aki88
 
Posts: 381
Joined: Tue Jan 28, 2014 1:52 pm
Has thanked: 33 times
Been thanked: 36 times

Re: Discussion: SORTOUT different for same i/p over diff run

Postby dick scherrer » Wed Mar 26, 2014 12:29 am

why would a record be rearranged in sequence of appearance, what would make this happen; remembering that our source code hasn't changed between the runs ideally o/p should be same.
Please correct me if I have it wrong there.

Yup, you have a misunderstanding.

This (most likely) has nothing to do with source code. If it is ideal to retain the incoming sequence, EQUALS should be specified.

What you see is the normal result of a sort with duplicate keys when EQUALS is not in affect. Using EQUALS adds an invisible sequence number to each record which is sorted along with the key preserving the original order of records with the same key.
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Re: Discussion: SORTOUT different for same i/p over diff run

Postby skolusu » Wed Mar 26, 2014 12:43 am

Aki88 wrote:Curious as to why would SORT result in different record arrangements for same i/p, same sorting parameters, if SORT is 'in the end' running the same algorithm for sorting??


well let me try to explain with an example.

You are given a bin of legos and you are asked to sort and line them up based on the color in the fastest time possible. You quickly run thru the legos and arrange them based on the color. Now imagine what if you are asked to align the like colored legos based on their how they are dumped into the bin. Unless you have a way of identifying them you cannot align in the same sequence that the legos are dumped into the bin.

Now do you understand as to why the records differ even though the alogirthm is to arrange the legos based on color? Records are read into storage/memory/disk and then sorting is performed.

What do you do to preserve the order in which the legos are dropped in? You need to identify them uniquely, so that you can use that identifier and arrange them in order. That is where the parm EQUALS come into picture. So what does EQUALS do? It specifically adds a sequence number to each lego being dropped which will later be used to arrange the legos in the order it was dropped.
Kolusu - DFSORT Development Team (IBM)
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
skolusu
 
Posts: 586
Joined: Wed Apr 02, 2008 10:38 pm
Has thanked: 0 time
Been thanked: 39 times

Re: Discussion: SORTOUT different for same i/p over diff run

Postby Aki88 » Wed Mar 26, 2014 12:50 am

Thanks 'gain Dick.
Will raise this with the product development team to see if they can get this done in the coming releases; and would keep (a refresher note :) to self) in mind to code an 'Equals' in places where the sequence is to be retained.

Though this really did seem tad bit weird to me, that the records are correctly sequenced (all records in ascending/descending order across all columns depending on the SORT card) for a file with fewer records as compared to a relatively larger file; wondering what might be causing it; agreeing on the fact that Equals is a solution to this problem.

Thanks.
Aki88
 
Posts: 381
Joined: Tue Jan 28, 2014 1:52 pm
Has thanked: 33 times
Been thanked: 36 times

Re: Discussion: SORTOUT different for same i/p over diff run

Postby Aki88 » Wed Mar 26, 2014 1:03 am

skolusu wrote:
Aki88 wrote:Curious as to why would SORT result in different record arrangements for same i/p, same sorting parameters, if SORT is 'in the end' running the same algorithm for sorting??


You are given a bin of legos and you are asked to sort and line them up based on the color in the fastest time possible. You quickly run thru the legos and arrange them based on the color. Now imagine what if you are asked to align the like colored legos based on their how they are dumped into the bin. Unless you have a way of identifying them you cannot align in the same sequence that the legos are dumped into the bin.


Hello Kolusu,

Thank you for the explanation; I'd missed this post of yours hence my last reply (addressed to Dick). Yup, that is how I'd summarized it when we'd first seen this happening; how'd SORT know which records are in what order aside from the already coded rules passed through the SORT card; which is why the shuffling in sequence of records outside the sort keys; but was curious that why this would happen keeping in mind that SORT would process each record one by one, loading them into the storage area, arranging them in order of SORT keys- here being the catch that the order in which records had been read from the i/p wouldn't be retained until EQUALS is in effect.

Thanks.
Aki88
 
Posts: 381
Joined: Tue Jan 28, 2014 1:52 pm
Has thanked: 33 times
Been thanked: 36 times


Return to DFSORT/ICETOOL/ICEGENER

 


  • Related topics
    Replies
    Views
    Last post