Page 1 of 1

Sort-join without disturbing the driver

PostPosted: Wed Jul 03, 2013 2:37 pm
by thegadgetfreak
Hi,

I have two files with a 45 byte key. The key identifies one record but the records are in lots. To identify a lot, 22 byte keys are enough.

A sample 45 byte key would be
AAA XXXXXXXXXYYYYYYYYY10120120910120910X12102
AAA XXXXXXXXXYYYYYYYYY10120120910120910X12103
AAA XXXXXXXXXYYYYYYYYY10120130110130110X12104
AAA XXXXXXXXXYYYYYYYYY10120130110130110X12105


I want to just find a match for the 22 bytes in the file F1 and not the 45 bytes to identify if the lot exists in the file F2. There may be cases when the record may not exist in F2 but the lot may be there so it is necessary to match 22 bytes.

Then using outfil I am separating records
> if there is a match with a field as zero or if no match and a date field is older than 5 days.
> if there is a match with a field as non-zero or if no match and a date field is within 5 days old.

Please find my sort card below

  JOINKEYS FILES=F1,FIELDS=(1,22,A)               
  JOINKEYS FILES=F2,FIELDS=(1,22,A)     
**                                                                   
  JOIN UNPAIRED,F1                                                   
**                                                                   
  REFORMAT FIELDS=(?,                                               
                   F1:1,596)             
**                                                                   
  SORT FIELDS=(2,45,CH,A)                                           
  DUPKEYS FIELDS=NONE                                               
**                                                                   
  OUTFIL FILES=01,INCLUDE=((1,1,CH,EQ,C'B',AND,                     
                            108,8,ZD,EQ,0),OR,                       
                           (1,1,CH,EQ,C'1',AND,                     
                            35,6,Y2T,LT,Y'DATE1'-5)),               
    OUTREC=(2,596)                                                   
**                                                                   
  OUTFIL FILES=02,INCLUDE=((1,1,CH,EQ,C'B',AND,                     
                            108,8,ZD,NE,0),OR,                       
                           (1,1,CH,EQ,C'1',AND,       
                            35,6,Y2T,GE,Y'DATE1'-5)),
    OUTREC=(2,596)                                   
**                   



The problem I have is that the input(in both files) may go upto billions, yes billions of records. And this card is very inefficient though it works.

The join creates huge number of dupes and then removes the dupes --- I want to know if there is a way around this.

Bottom line - I want the count of records in F1 to be split in the output. The driver(File F1) records should not have duplicates(for 45 bytes) just after the join nor do I want any records to be missing from the driver. The input file F2 will also have dupes with 22 byte keys.

Could you please help me achieve this with Syncsort in the most efficient manner?

I have searched everywhere but am not getting any clue.

Thanks in advance for your help.

Re: Sort-join without disturbing the driver

PostPosted: Wed Jul 03, 2013 10:16 pm
by dick scherrer
Hello,

Are the files already in the needed sequence?

One possibility might be to sort both files removeing the duplicates Before the JOINKEYS.

Then the JOIN would not have to deal with duplicates.

Re: Sort-join without disturbing the driver

PostPosted: Thu Jul 04, 2013 5:07 am
by BillyBoyo
How often does this run? What is the answer to Dick's question?

You are doing three SORTs, if you do none, it will be considerably faster. There are other things with what you have already.

There is likely a huge saving of resources possible.

Why doesn't your company/client pay a consultant with good knowledge of SORT to do this? This website is a "Beginners and Students" site, not a "free consulting" site, and I don't think you are a beginner or a student.

For a consultancy, you'd need to provide much fuller information, including complete information about both the 22- and 45-byte keys, plus full information about what the process is aiming to achieve, fully "life history" of the input files, etc, etc. You might think that a consultancy might cost a lot, but I'd really expect substantial savings (CPU, DASD, Elapsed) from the process.

You can also consider the fact that you are paying Support Fees to SyncSort. I'm not sure how much "consultancy" they'd provide over the wider processing, but I'm sure they would advise you on processing these large datasets and give you Sort Control Cards which are more efficient than those you have now.

Since I don't want to do myself out of a job, don't expect exact solutions here from me, please :-)

Re: Sort-join without disturbing the driver

PostPosted: Thu Jul 04, 2013 10:30 am
by thegadgetfreak
Dick and Bill thanks for your replies...

Dick,
Yes, I could remove the 22 byte dupes from file 2 and do this but just out of curiosity, I was wondering if this could be done in a single sort step. Inputs will be sorted on the 45 byte key and not the 22 byte key.

Bill,
I am not representing a company. This is of my own personal interest and I am asking out of curiosity. If it was anyone else in my place, they would have written a Cobol pgm and be done with it.

Rephrasing my question:

File1:

AAA XXXXXXXXXYYYYYYYYY10120120910120910X12102
AAA XXXXXXXXXYYYYYYYYY10120120910120910X12103
AAA XXXXXXXXXYYYYYYYYY10120130110130110X12104
AAA XXXXXXXXXYYYYYYYYY10120130110130110X12105
DDD XXXXXXXXXYYYYYYYYY10120130110130110X12101

File2:

AAA XXXXXXXXXYYYYYYYYY10120120910120910X12102VAR1
BBB XXXXXXXXXYYYYYYYYY10120120910120910X12103VAR1
BBB XXXXXXXXXYYYYYYYYY10120130110130110X12105VAR1
CCC XXXXXXXXXYYYYYYYYY10120130110130110X12104VAR1

Output:

AAA XXXXXXXXXYYYYYYYYY10120120910120910X12102VAR1
AAA XXXXXXXXXYYYYYYYYY10120120910120910X12103VAR1
AAA XXXXXXXXXYYYYYYYYY10120130110130110X12104VAR1
AAA XXXXXXXXXYYYYYYYYY10120130110130110X12105VAR1
DDD XXXXXXXXXYYYYYYYYY10120130110130110X12101

1) Join should be on 22 bytes.
2) Records in File1 should come in the output without duplicates and without any deletions.
3) The output - I can later split it as per my requirement. My problem is with the join.

Thanks,
Ajit

Re: Sort-join without disturbing the driver

PostPosted: Thu Jul 04, 2013 12:32 pm
by BillyBoyo
If your data is already sorted on the 45 bytes, then it is already sorted on the 22 bytes.

You can make some very simple test data, with shorter keys even, and demonstrate to yourself that this is true.

Then you can put SORTED on the JOINKEYS statements. That's two SORTs gone.

If the data is already sorted on the 45 bytes, you don't need to do it again. DUPKEYS/SUM won't work without a SORT, however. So look at how to use OUTFIL SECTIONS, along with REMOVECC and NODETAIL.

Your current conditions for the two OUTFILs are mutually exclusive. Have one OUTFIL with the conditions on and one with SAVE.

If you'd genuinely had billions of records, then there'd be more you'd need to do.

Re: Sort-join without disturbing the driver

PostPosted: Thu Jul 04, 2013 1:38 pm
by thegadgetfreak
Thank you so much Bill. I'll add sorted.

I'm trying to see if I can eliminate the dupes some way in the join itself so that I need not filter using an OUTFIL(or anything after the join).

I think I'll end up coding a cobol pgm itself in the interest of time. :(

Re: Sort-join without disturbing the driver

PostPosted: Thu Jul 04, 2013 10:46 pm
by BillyBoyo
Well, when you have a duplicate, do you want the first from the set, or the last, or some other? Bearing in mind the 22 vs 45 bytes.

If you want the first, you can look to use WHEN=GROUP, I don't think SyncSort has KEYBEGIN, but you can "emulate" it by using RESTART on a sequence number you generate. Then if you add a sequence number with the PUSH (from the WHEN=GROUP) you can INCLUDE only those that have a sequence of one.

If you want the last duplicate, you can look at the SECTIONS on OUTFIL. If you want some other one, then there is more thinking.

Re: Sort-join without disturbing the driver

PostPosted: Fri Jul 05, 2013 8:05 pm
by thegadgetfreak
Thank you Bill. I will definately let you know when I try this. For now our cobol program is done and my team went with it.

The join is just an existance check, so the first is enough. But still using WHEN-GROUP...etc happens after the actual join isnt it?

Re: Sort-join without disturbing the driver

PostPosted: Fri Jul 05, 2013 10:11 pm
by BillyBoyo
Well, you seem to have the "match marker", the ? in the REFORMAT statement, which for SyncSort would imply that you can use the JNFnCNTL files, which may again mean that you can do something with the WHEN=GROUP prior to the JOINKEYS, but again it all depends on what you actually want from the data..

Re: Sort-join without disturbing the driver

PostPosted: Tue Jul 16, 2013 12:12 pm
by thegadgetfreak
Thanks Bill, I tried all this out. Its improving the efficiency... I guess a lot more could be done but we had to go for a pgm for now owing to time constraints... Thanks Bill.