Page 1 of 1

Sort a file with duplicate keys into multiple unique files

PostPosted: Sat Sep 05, 2009 2:50 am
by cbrio
I have a file with multiple records on a key variable. I want to split that file into multiple files each of which contains a single copy of the duplicated records so that each file contains only unique keys. One problem is that the number of records with the same key varies so I don't know how many output files are needed.


input
key: 1-5, all the rest of the data: 6ff

11111
11111
11111
22222
22222
33333
33333
33333
33333
44444
44444
55555
55555
55555

file 1
11111
22222
33333
44444
55555


file 2
11111
22222
33333
44444
55555

file 3
11111
33333
55555

etc....all the way to file N

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Sat Sep 05, 2009 3:35 am
by dick scherrer
Hello,

What if there are 978 duplicates for some key value?

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Sat Sep 05, 2009 5:28 am
by Frank Yaeger
I don't know how many output files are needed


Do you know the maximum number of output files that could be needed? Is it something like 10, 100, 1000 or 10000?

What would a DD statement for one of the output data sets look like?

Are you allowed to submit a job to the internal reader?

What is the RECFM and LRECL of the input file?

all the rest of the data: 6ff


I have no idea what you mean by this.

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Wed Sep 09, 2009 10:30 pm
by cbrio
By 6ff, I meant that all of the rest of the data start in column 6, and continue forward to the end of the record.

I have numerous files with this issue, here is an example of the file layout of one of them:
KEY 1 17 A17
SEQNO 18 22 A5
BOROUGH 23 24 A2
ZIP 25 29 A5
ADDRESS 30 59 A30
TRACT00 60 72 A13
LPDATE_1 73 80 A8
DDATE_P 81 88 A8
DDATE_S1 89 96 A8
TM2NXTLP 97 109 A13
RES_UNIT 110 122 A13
ADD_NUM 123 135 A13
ADD_STRT 136 160 A25
LP_OUTCM 161 170 A10
BORO 171 171 A1
SORTDATE 172 177 A6
ALTKEY 178 220 A43

I am working with RECFM=FB, LRECL in this case 220

Here are a couple of actual records from this file (sorry, they are wrapping here because they are so long):

100031IRVIN 63853MN100031 IRVING PLACE 02/20/0312/23/9707/05/05 1 1 IRVING PLACE 5-Arms le M030220100031IRVING PLACE
100031IRVIN 63854MN100031 IRVING PLACE 03/21/0311/15/89 363 1 1 IRVING PLACE 6-Subsequ M030321100031IRVING PLACE


I do not know the maximum number of output files (it will vary depending on the data), but I believe the maximum would be on the order of tens, not hundreds or thousands (e.g., perhaps there might be 15 repetitions of the key). This is just a guess, I have no way of knowing before hand.

Here is a sample DD statement (does not include recfm and lrecl because those will replicate the input file (FB and lrecl=220 in this case)

UNIQ1 DD DSN=$ZJI.ADDRESS.UNIQKEY1.Y0809,
DISP=(NEW,CATLG),UNIT=SYSDA,
SPACE=(TRK,(125,10),RLSE)


I believe that I am not allowed to submit a job to the internal reader.

Thank you for your help.

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Wed Sep 09, 2009 11:04 pm
by Frank Yaeger
You can use a DFSORT job like the following to do what you want. You'll need OUT01-OUTnn DD statements for the maximum number of possible output files (nn).

//S1    EXEC  PGM=SORT
//SYSOUT    DD  SYSOUT=*
//SORTIN DD DSN=...  input file (FB/220)
//OUT01 DD DSN=...   output file 01 (FB/220)
//OUT02 DD DSN=...   output file 02 (FB/220)
...
//OUTnn DD DSN=...   output file nn (FB/220)
//SYSIN    DD    *
  OPTION COPY
  INREC OVERLAY=(221:SEQNUM,2,ZD,RESTART=(1,5))
  OUTFIL FNAMES=OUT01,INCLUDE=(221,2,ZD,EQ,1),BUILD=(1,220)
  OUTFIL FNAMES=OUT02,INCLUDE=(221,2,ZD,EQ,2),BUILD=(1,220)
  ...
  OUTFIL FNAMES=OUTnn,INCLUDE=(221,2,ZD,EQ,nn),BUILD=(1,220)
/*

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Thu Sep 10, 2009 1:36 am
by cbrio
Thank you for this.

Will you please explain what RESTART=(1,5) is doing?

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Thu Sep 10, 2009 3:02 am
by Frank Yaeger
It restarts the sequence number at 1 each time the value in positions 1-5 changes. So the INREC records would look like this:

1            221
11111  ...   01
11111        02
11111        03
22222        01
22222        02
33333        01
33333        02
33333        03
33333        04
44444        01
44444        02
55555        01
55555        02
55555        03

Re: Sort a file with duplicate keys into multiple unique files

PostPosted: Thu Sep 10, 2009 3:16 am
by cbrio
Thank you for your help. I really appreciate it.
That helps me know how to adapt the code for the other files that I need to sort.