Page 1 of 1

Remove Duplicate string(s) from data set

PostPosted: Sun Jun 25, 2023 7:40 pm
by azhar
0

I have a input file like below :

A->B->C->E
A->B->C->D
B->C->D
C->D
D->E ........ ........
My requirement is to write only unique string in output file. If any substring is repeated in any record then do not write in output file.

Output file should be like below :

A->B->C->E
A->B->C->D
D->E
Skip the record 3rd and 4th as these strings are already present in 2nd record.

How can I achieve this through COBOL or Utility program?

Re: Remove Duplicate string(s) from data set

PostPosted: Sun Jun 25, 2023 8:50 pm
by sergeyken
Before using any COBOL, or Utility, or any other tool you need mandatory to prepare an algorithm to provide required operations. In 99.99% of all cases the algorithm design has nothing to do with neither COBOL, nor any other specific tool.

Do you have any idea: what is the sequence of required actions to achieve your goal?

Re: Remove Duplicate string(s) from data set

PostPosted: Sun Jun 25, 2023 9:31 pm
by enrico-sorichetti
wiser to review the application logic and the requirement

for each record it would be necessary to compute all the continuous subsequences
( A continuous subsequence is one in which no elements are missing between the first and last elements of the subsequence )

for the first record
A->B->C->E

they would be

A->B
B->C
C->E
A->B->C
B->C->E
 


collect all the subsequences and find the way to delete the relevant records
( I am no sort expert, but I guess it could be done with a joinkeys between the original dataset and a dataset containing the subsequences

naturally building the dataset with the subsequences might be more or less complicated depending on the REAL pattern in each record
still complicated anyway

Re: Remove Duplicate string(s) from data set

PostPosted: Sun Jun 25, 2023 10:00 pm
by sergeyken
Generally speaking, this process involves “cartesian product”; the records must be analyzed in pairs: each to each.

If the input data size is in the range 10-1000, or maximum 10000 records, a straightforward stupid comparison one-to-one can be used.

If the input size is 100,000-100,000,000 or more, then some sophisticated approach (aka algorithm) may be required.

All details like these must be clarified clearly before doing any job.

Re: Remove Duplicate string(s) from data set

PostPosted: Mon Jun 26, 2023 12:56 am
by sergeyken
One of many possible ways to do it.

This must be designed even before selection of the most suitable tool to implement it!

Step 1.
Create a modified copy of the source data:
- define the size of the “meaningful” part of each record (without trailing blanks),
- re-order the records, to make the records with the same “meaningful size” grouped together.

Step 2.
Create so called “full outer join” of the two files (all pairs of records excluding records joined to its own copies)

Step 3.
Process the (huge!) joined file, eliminating those pairs where the string from the first part of record is also a substring from its second part.
If the joined file is sorted by both left part size, and second part size, then each pass of Step 3 may be stopped as soon as the length of the left part becomes longer than the size of the right part (to optimize this time-consuming process)

Step 4.
Produce the output resulting set of records, using left parts of all non-rejected records from Step 3.

That’s it. As for myself, I could implement it:
- using COBOL, PL/I, Assembler, C/C++, and some other compiled languages,
- using REXX, or other interpreted language,
- using SORT facility, or maybe other file-processing tool (like FileAid? not sure about the details),
- highly likely there are also other available tools to implement the desired algorithm.