Remove Duplicate string(s) from data set



Support for OS/VS COBOL, VS COBOL II, COBOL for OS/390 & VM and Enterprise COBOL for z/OS

Remove Duplicate string(s) from data set

Postby azhar » Sun Jun 25, 2023 7:40 pm

0

I have a input file like below :

A->B->C->E
A->B->C->D
B->C->D
C->D
D->E ........ ........
My requirement is to write only unique string in output file. If any substring is repeated in any record then do not write in output file.

Output file should be like below :

A->B->C->E
A->B->C->D
D->E
Skip the record 3rd and 4th as these strings are already present in 2nd record.

How can I achieve this through COBOL or Utility program?
azhar
 
Posts: 2
Joined: Wed May 31, 2023 4:20 pm
Has thanked: 0 time
Been thanked: 0 time

Re: Remove Duplicate string(s) from data set

Postby sergeyken » Sun Jun 25, 2023 8:50 pm

Before using any COBOL, or Utility, or any other tool you need mandatory to prepare an algorithm to provide required operations. In 99.99% of all cases the algorithm design has nothing to do with neither COBOL, nor any other specific tool.

Do you have any idea: what is the sequence of required actions to achieve your goal?
Javas and Pythons come and go, but JCL and SORT stay forever.
User avatar
sergeyken
 
Posts: 409
Joined: Wed Jul 24, 2019 10:12 pm
Has thanked: 6 times
Been thanked: 40 times

Re: Remove Duplicate string(s) from data set

Postby enrico-sorichetti » Sun Jun 25, 2023 9:31 pm

wiser to review the application logic and the requirement

for each record it would be necessary to compute all the continuous subsequences
( A continuous subsequence is one in which no elements are missing between the first and last elements of the subsequence )

for the first record
A->B->C->E

they would be

A->B
B->C
C->E
A->B->C
B->C->E
 


collect all the subsequences and find the way to delete the relevant records
( I am no sort expert, but I guess it could be done with a joinkeys between the original dataset and a dataset containing the subsequences

naturally building the dataset with the subsequences might be more or less complicated depending on the REAL pattern in each record
still complicated anyway
cheers
enrico
When I tell somebody to RTFM or STFW I usually have the page open in another tab/window of my browser,
so that I am sure that the information requested can be reached with a very small effort
enrico-sorichetti
Global moderator
 
Posts: 2994
Joined: Fri Apr 18, 2008 11:25 pm
Has thanked: 0 time
Been thanked: 164 times

Re: Remove Duplicate string(s) from data set

Postby sergeyken » Sun Jun 25, 2023 10:00 pm

Generally speaking, this process involves “cartesian product”; the records must be analyzed in pairs: each to each.

If the input data size is in the range 10-1000, or maximum 10000 records, a straightforward stupid comparison one-to-one can be used.

If the input size is 100,000-100,000,000 or more, then some sophisticated approach (aka algorithm) may be required.

All details like these must be clarified clearly before doing any job.
Javas and Pythons come and go, but JCL and SORT stay forever.
User avatar
sergeyken
 
Posts: 409
Joined: Wed Jul 24, 2019 10:12 pm
Has thanked: 6 times
Been thanked: 40 times

Re: Remove Duplicate string(s) from data set

Postby sergeyken » Mon Jun 26, 2023 12:56 am

One of many possible ways to do it.

This must be designed even before selection of the most suitable tool to implement it!

Step 1.
Create a modified copy of the source data:
- define the size of the “meaningful” part of each record (without trailing blanks),
- re-order the records, to make the records with the same “meaningful size” grouped together.

Step 2.
Create so called “full outer join” of the two files (all pairs of records excluding records joined to its own copies)

Step 3.
Process the (huge!) joined file, eliminating those pairs where the string from the first part of record is also a substring from its second part.
If the joined file is sorted by both left part size, and second part size, then each pass of Step 3 may be stopped as soon as the length of the left part becomes longer than the size of the right part (to optimize this time-consuming process)

Step 4.
Produce the output resulting set of records, using left parts of all non-rejected records from Step 3.

That’s it. As for myself, I could implement it:
- using COBOL, PL/I, Assembler, C/C++, and some other compiled languages,
- using REXX, or other interpreted language,
- using SORT facility, or maybe other file-processing tool (like FileAid? not sure about the details),
- highly likely there are also other available tools to implement the desired algorithm.
Javas and Pythons come and go, but JCL and SORT stay forever.
User avatar
sergeyken
 
Posts: 409
Joined: Wed Jul 24, 2019 10:12 pm
Has thanked: 6 times
Been thanked: 40 times


Return to IBM Cobol

 


  • Related topics
    Replies
    Views
    Last post