Sort a file with duplicate keys into multiple unique files
Posted:
Sat Sep 05, 2009 2:50 am
by cbrio
I have a file with multiple records on a key variable. I want to split that file into multiple files each of which contains a single copy of the duplicated records so that each file contains only unique keys. One problem is that the number of records with the same key varies so I don't know how many output files are needed.
input
key: 1-5, all the rest of the data: 6ff
11111
11111
11111
22222
22222
33333
33333
33333
33333
44444
44444
55555
55555
55555
file 1
11111
22222
33333
44444
55555
file 2
11111
22222
33333
44444
55555
file 3
11111
33333
55555
etc....all the way to file N
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Sat Sep 05, 2009 3:35 am
by dick scherrer
Hello,
What if there are 978 duplicates for some key value?
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Sat Sep 05, 2009 5:28 am
by Frank Yaeger
I don't know how many output files are needed
Do you know the maximum number of output files that could be needed? Is it something like 10, 100, 1000 or 10000?
What would a DD statement for one of the output data sets look like?
Are you allowed to submit a job to the internal reader?
What is the RECFM and LRECL of the input file?
all the rest of the data: 6ff
I have no idea what you mean by this.
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Wed Sep 09, 2009 10:30 pm
by cbrio
By 6ff, I meant that all of the rest of the data start in column 6, and continue forward to the end of the record.
I have numerous files with this issue, here is an example of the file layout of one of them:
KEY 1 17 A17
SEQNO 18 22 A5
BOROUGH 23 24 A2
ZIP 25 29 A5
ADDRESS 30 59 A30
TRACT00 60 72 A13
LPDATE_1 73 80 A8
DDATE_P 81 88 A8
DDATE_S1 89 96 A8
TM2NXTLP 97 109 A13
RES_UNIT 110 122 A13
ADD_NUM 123 135 A13
ADD_STRT 136 160 A25
LP_OUTCM 161 170 A10
BORO 171 171 A1
SORTDATE 172 177 A6
ALTKEY 178 220 A43
I am working with RECFM=FB, LRECL in this case 220
Here are a couple of actual records from this file (sorry, they are wrapping here because they are so long):
100031IRVIN 63853MN100031 IRVING PLACE 02/20/0312/23/9707/05/05 1 1 IRVING PLACE 5-Arms le M030220100031IRVING PLACE
100031IRVIN 63854MN100031 IRVING PLACE 03/21/0311/15/89 363 1 1 IRVING PLACE 6-Subsequ M030321100031IRVING PLACE
I do not know the maximum number of output files (it will vary depending on the data), but I believe the maximum would be on the order of tens, not hundreds or thousands (e.g., perhaps there might be 15 repetitions of the key). This is just a guess, I have no way of knowing before hand.
Here is a sample DD statement (does not include recfm and lrecl because those will replicate the input file (FB and lrecl=220 in this case)
UNIQ1 DD DSN=$ZJI.ADDRESS.UNIQKEY1.Y0809,
DISP=(NEW,CATLG),UNIT=SYSDA,
SPACE=(TRK,(125,10),RLSE)
I believe that I am not allowed to submit a job to the internal reader.
Thank you for your help.
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Wed Sep 09, 2009 11:04 pm
by Frank Yaeger
You can use a DFSORT job like the following to do what you want. You'll need OUT01-OUTnn DD statements for the maximum number of possible output files (nn).
//S1 EXEC PGM=SORT
//SYSOUT DD SYSOUT=*
//SORTIN DD DSN=... input file (FB/220)
//OUT01 DD DSN=... output file 01 (FB/220)
//OUT02 DD DSN=... output file 02 (FB/220)
...
//OUTnn DD DSN=... output file nn (FB/220)
//SYSIN DD *
OPTION COPY
INREC OVERLAY=(221:SEQNUM,2,ZD,RESTART=(1,5))
OUTFIL FNAMES=OUT01,INCLUDE=(221,2,ZD,EQ,1),BUILD=(1,220)
OUTFIL FNAMES=OUT02,INCLUDE=(221,2,ZD,EQ,2),BUILD=(1,220)
...
OUTFIL FNAMES=OUTnn,INCLUDE=(221,2,ZD,EQ,nn),BUILD=(1,220)
/*
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Thu Sep 10, 2009 1:36 am
by cbrio
Thank you for this.
Will you please explain what RESTART=(1,5) is doing?
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Thu Sep 10, 2009 3:02 am
by Frank Yaeger
It restarts the sequence number at 1 each time the value in positions 1-5 changes. So the INREC records would look like this:
1 221
11111 ... 01
11111 02
11111 03
22222 01
22222 02
33333 01
33333 02
33333 03
33333 04
44444 01
44444 02
55555 01
55555 02
55555 03
Re: Sort a file with duplicate keys into multiple unique files
Posted:
Thu Sep 10, 2009 3:16 am
by cbrio
Thank you for your help. I really appreciate it.
That helps me know how to adapt the code for the other files that I need to sort.