OPTION SKIPREC=60000
OMIT COND=(5,14,CH,EQ,C'00000000000000',OR,
5,14,CH,EQ,C'99999999999999')
INREC IFTHEN=(WHEN=(79,4,CH,EQ,C'XXXX'),
PARSE=(%01=(STARTAFT=C'<idrequest>',
ENDBEFR=C'</idrequest>',FIXLEN=8)),
BUILD=(1:1,4,5:%01,13:C'¦',14:5,4000)),
IFTHEN=(WHEN=(79,4,CH,EQ,C'YYYY'),
PARSE=(%02=(STARTAFT=C'?><',FIXLEN=4)),
BUILD=(1:1,4,5:259,8,13:C'¦',14:5,4000))
OUTFIL REMOVECC,VTOF,BUILD=(1:5,4013),VLFILL=C' '
SORT FIELDS=(1,8,CH,A)
SUM FIELDS=NONE
We have a VB input file (relevant attrib given below) which has over 60 mil records in xml layout (this dataset is a combined unsorted data pool of over 50 datasets from different sources). These records are a mix of variable and fixed format records, meaning - in case of fixed format, data is present at a specific location controlled by record layout copybooks, au-contraire in case of variable layout records, data is free-form. Variable record data is built by reference modification and padding of xml tags.
Organization . . . : PS
Record format . . . : VB
Record length . . . : 30018
Block size . . . . : 30022
Both the aforementioned record types present in this dataset have an identifier (259,8 --> for fixed layout; preceding </idrequest> tag for variable layout). The aim is to extract unique identifiers records from the input file.
Current approach:
a. Since the input is variable length, hence extend the records temporarily towards left.
b. Dump the identifier in this extended field
c. Once INREC processing is completed, SORT the data and apply SUM FIELD=NONE on the extended key
Problem with this approach:
a. I feel that this code can be made better- much better; it is very clunky at the moment.
b. If the record count is increased by even a mil, the SORTing goes haywire
c. To fix the problem in point-b, a two step approach is taken, i.e. build the INREC data first; and then have another step which SORTs and SUM FIELDS the key position.
Any suggestions on trimming the above SORT card are much appreciated.
PS: The SKIPREC is to avoid a junk chunk of data, and can be ignored. The 'XXXX' and 'YYYY' are identifiers that segregate the fixed/variable records. Xs are variable records, Ys are fixed records.
Thank you.