Removing duplicates!!!



Support for NetApp SyncSort for z/OS, Visual SyncSort, SYNCINIT, SYNCLIST and SYNCTOOL

Removing duplicates!!!

Postby ibmmf4u » Mon Jul 30, 2012 10:33 pm

Hi Everyone,

My requirement goes this way.I would like to remove duplicates from a file.Below is an example of input file.

Input file:-
00001234    TEST1      TESTING
00001234               DESCRIPTION
00003456    TEST2      TESTER2
00003456               DESC
00001234    TEST1      TESTER1
00001234               EXAMPLE1
00003456    TEST2      TESTER2
00003456               EXAMPLE2
00001234    TEST1      TESTSAMP
00001234               SAMPLE1
     


The first 8 positions were the key fields and the positions from 13 till 17 are the descriptive fields.Now that i would like to remove the other occurence's of the field such that if it encounters the same key's feilds and same descriptive fields again.

The output file should contain following records, eliminating the other duplicate occurrence's of the keyfield+ descriptive field.

Output file:-
00001234    TEST1      TESTING
00001234               DESCRIPTION
00003456    TEST2      TESTER2
00003456               DESC


I would like to remove the other occurrence's of the fields.
ibmmf4u
 
Posts: 65
Joined: Wed Dec 14, 2011 10:26 pm
Has thanked: 0 time
Been thanked: 1 time

Re: Removing duplicates!!!

 

Re: Removing duplicates!!!

Postby dick scherrer » Mon Jul 30, 2012 11:45 pm

Hello,

What business requiement will be met by this? Why are duplicates (that aren't really duplicate as the data other than the key is different) being discarded?

Why is the data not in sequence?

If you post what is really going on, someone may have a suggestion.
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6304
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 91 times

Re: Removing duplicates!!!

Postby BillyBoyo » Tue Jul 31, 2012 4:02 am

The method is quite simple, but as Dick has said, I'm not sure that you have explained enough about what is happening.

You have groups of data again. You need to sort on the groups and decide which group to retain. There must be a business requirement which details how you decide. That may well be crucial to the solution.
BillyBoyo
Global moderator
 
Posts: 3804
Joined: Tue Jan 25, 2011 12:02 am
Has thanked: 22 times
Been thanked: 264 times

Re: Removing duplicates!!!

Postby bodatrinadh » Tue Jul 31, 2012 2:44 pm

Hi ibmmf4u,

You can try this code..

//STEP1     EXEC PGM=SORT
//SORTIN   DD *
00001234    TEST1      TESTING
00001234               DESCRIPTION
00003456    TEST2      TESTER2
00003456               DESC
00001234    TEST1      TESTER1
00001234               EXAMPLE1
00003456    TEST2      TESTER2
00003456               EXAMPLE2
00001234    TEST1      TESTSAMP
00001234               SAMPLE1
//SORTOUT  DD SYSOUT=*
//SYSOUT   DD SYSOUT=*
//SYSIN    DD *
  SORT FIELDS=(1,8,CH,A)
  OUTREC IFTHEN=(WHEN=INIT,OVERLAY=(61:SEQNUM,4,ZD,RESTART=(1,8)))
  OUTFIL BUILD=(1,60,20X),OMIT=(61,4,ZD,GE,+3)


Output :-

00001234    TEST1      TESTING
00001234               DESCRIPTION
00003456    TEST2      TESTER2
00003456               DESC
Thanks
-3nadh
User avatar
bodatrinadh
 
Posts: 67
Joined: Thu Jan 12, 2012 9:05 pm
Has thanked: 0 time
Been thanked: 4 times

Re: Removing duplicates!!!

Postby dick scherrer » Tue Jul 31, 2012 7:48 pm

Hello,

Thank you for the contribution, but i believe TS needs to provide some more clarification before taking some code and running with it.

In addition to getting what was asked for, i believe more needs to be understood (by TS's organization) about the real intent of this process. I'm scratching my head as to how it might be all right to arbitrarily toss away "stuff" from some unsorted input file. . .
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6304
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 91 times

Re: Removing duplicates!!!

Postby ibmmf4u » Tue Jul 31, 2012 7:57 pm

Hello Dick/Bill,

Its mainly for the reporting purpose. Everyday we will be sending out the reports which contains the transactions which were invalid/failed along with some kind of description including the transaction.Where if one transaction is already there in the report with some description in it which it's already invalid/failed and hence we need not send out the other occurrence of the same transaction with it's description , hence we will be eliminating the other occurrences. That's the reason behind it.

Bill, Thanks for the inputs.

Hi bodatrinadh,

Thanks for the above piece of code, i will test it and let you know the outcome.

Thank you All!!!!!!!!!!!
ibmmf4u
 
Posts: 65
Joined: Wed Dec 14, 2011 10:26 pm
Has thanked: 0 time
Been thanked: 1 time

Re: Removing duplicates!!!

Postby ibmmf4u » Tue Jul 31, 2012 8:20 pm

Hi bodatrinadh,

Thanks a lot . The above piece of code is working fine if the actual description field contains only two lines but in actual we weren't sure of the number of the descriptor lines.

Pasted below is an example of the Input file.
00001234    TEST1      TESTING     
00001234               DESCRIPTION
00001234               TESTDUPL   
00003456    TYPE2      TESTER2     
00003456               DESC1       
00003456               DESC2       
00003456               DESC3       
00004567    ERROR1     ERROR DESC1
00004567               DESC2       
00004567               DESC3       
00001234    TEST1      TESTER1     
00001234               EXAMPLE1   
00003456    TYPE2      TESTER2     
00003456               EXAMPLE2   
00004567    ERROR1     DESC LINE1 
00004567               DESC LINE2 
00001234    TEST1      TESTSAMP   
00001234               SAMPLE1     


Expected output file:-

00001234    TEST1      TESTING     
00001234               DESCRIPTION
00001234               TESTDUPL   
00003456    TYPE2      TESTER2     
00003456               DESC1       
00003456               DESC2       
00003456               DESC3       
00004567    ERROR1     ERROR DESC1
00004567               DESC2       
00004567               DESC3       


Please help me in achieving the above.

Thanks in advance!!!
ibmmf4u
 
Posts: 65
Joined: Wed Dec 14, 2011 10:26 pm
Has thanked: 0 time
Been thanked: 1 time

Re: Removing duplicates!!!

Postby BillyBoyo » Tue Jul 31, 2012 8:37 pm

If you are certain that this is what is needed, then OK. However, your users are going to be annoyed if you show them there is one error, which they correct, and then the following day there is another that was previously known.

If you do a GROUP on the INREC, PUSHing an ID, then a GROUP on the OUTREC, PUSHing that existing ID to a new position, then on OUTFIL you can INCLUDE those where the two IDs are equal (or OMIT when they are not equal).

The GROUPs have to be done with the RESTART as before, because of no KEYBEGIN in Syncsort.

Note that contiguous error messages on the input file will be included as one GROUP (you'd find that in your testing, I hope). If that is not what you want, then you'd need to include the error message presence in the initial GROUP identification.
BillyBoyo
Global moderator
 
Posts: 3804
Joined: Tue Jan 25, 2011 12:02 am
Has thanked: 22 times
Been thanked: 264 times

Re: Removing duplicates!!!

Postby ibmmf4u » Fri Aug 10, 2012 12:22 pm

Hi Bill,

Sorry for the delayed reply.

I tried coding per your instructions but got stuck at the 3rd and 4th scenarios.

a GROUP on the OUTREC, PUSHing that existing ID to a new position,


The GROUPs have to be done with the RESTART as before


Pasted below is the piece of code and their outcome in each scenario.

Source code:-
//SYSIN      DD   *                           
  INREC IFTHEN=(WHEN=INIT,                   
                  OVERLAY=(41:SEQNUM,8,ZD,   
                              RESTART=(1,8)))
                                             
  SORT FIELDS=COPY                           


Output:-
00001234    TEST1      TESTING          00000001
00001234               DESCRIPTION      00000002
00001234               TESTDUPL         00000003
00003456    TYPE2      TESTER2          00000001
00003456               DESC1            00000002
00003456               DESC2            00000003
00003456               DESC3            00000004
00004567    ERROR1     ERROR DESC1      00000001
00004567               DESC2            00000002
00004567               DESC3            00000003
00001234    TEST1      TESTER1          00000001
00001234               EXAMPLE1         00000002
00003456    TYPE2      TESTER2          00000001
00003456               EXAMPLE2         00000002
00004567    ERROR1     DESC LINE1       00000001
00004567               DESC LINE2       00000002
00001234    TEST1      TESTSAMP         00000001
00001234               SAMPLE1          00000002


Source code:-
//SYSIN      DD   *                                     
  INREC IFTHEN=(WHEN=INIT,                             
                  OVERLAY=(41:SEQNUM,8,ZD,             
                              RESTART=(1,8))),         
                                                       
          IFTHEN=(WHEN=GROUP,                           
                       BEGIN=(41,8,CH,EQ,C'00000001'), 
                         PUSH=(51:1,8,60:ID=1))         
                                                       
  SORT FIELDS=COPY                                     
//*                                                     


Output:-
00001234    TEST1      TESTING          00000001  00001234 1
00001234               DESCRIPTION      00000002  00001234 1
00001234               TESTDUPL         00000003  00001234 1
00003456    TYPE2      TESTER2          00000001  00003456 2
00003456               DESC1            00000002  00003456 2
00003456               DESC2            00000003  00003456 2
00003456               DESC3            00000004  00003456 2
00004567    ERROR1     ERROR DESC1      00000001  00004567 3
00004567               DESC2            00000002  00004567 3
00004567               DESC3            00000003  00004567 3
00001234    TEST1      TESTER1          00000001  00001234 4
00001234               EXAMPLE1         00000002  00001234 4
00003456    TYPE2      TESTER2          00000001  00003456 5
00003456               EXAMPLE2         00000002  00003456 5
00004567    ERROR1     DESC LINE1       00000001  00004567 6
00004567               DESC LINE2       00000002  00004567 6
00001234    TEST1      TESTSAMP         00000001  00001234 7
00001234               SAMPLE1          00000002  00001234 7


I am not sure if i did interpret your statements correctly. Request you to Kindly help me in achieving the same.

Am sorry once again for the delayed response.

Thanks in advance!!!
ibmmf4u
 
Posts: 65
Joined: Wed Dec 14, 2011 10:26 pm
Has thanked: 0 time
Been thanked: 1 time

Re: Removing duplicates!!!

Postby BillyBoyo » Fri Aug 10, 2012 1:49 pm

That's OK as far as it goes. You do not need to have the 1,8 PUSHed, as it is already on each record.

Now you SORT, with EQUALS, on 1,8 and the ID.

On OUTFIL, define another GROUP (same technique) and PUSH the ID to another position, so you have two of them


11111111  1 1
11111111  1 1
11111111  2 1
11111111  2 1


Then in INCLUDE/OMIT on the OUTFIL, you can test first ID against second. If they are equal, that is the group you let through.

If you use an ID of lenth 1, you will get trouble with more than 10 groups of errors.

You need to look at the "continguous" error messages on your input file, which will be treated as one group.
BillyBoyo
Global moderator
 
Posts: 3804
Joined: Tue Jan 25, 2011 12:02 am
Has thanked: 22 times
Been thanked: 264 times

Next

Return to Syncsort/Synctool

 


  • Related topics
    Replies
    Views
    Last post