how to remove duplicates



Support for OS/VS COBOL, VS COBOL II, COBOL for OS/390 & VM and Enterprise COBOL for z/OS

Re: how to remove duplicates

Postby santlou » Fri Dec 12, 2008 5:07 am

Hi Guys,

If I may, can I suggest some alternative to this issue?

My suggestion is something that I've used very successfully several times in the past.

Basically, the solution is to store the data in the table as a Linked List data structure. This way, there is no need for a "Sort" process. Using an Internal Sort for 20 items is not very efficient, and not possible in a CICS environment.

The basic idea behind using a Singly-Linked-List data structure is that each occurrance in the table contains a "Pointer" to the Next Higher entry. The actual entries are stored in an Entry-Sequenced (Not Sorted) mode.

To support this process, you simply need to add one more field to the table:

10 WS-T1-POINTER PIC s9(3) comp-3.

Also you need to define a Starting Pointer, or "Root" in Working Storage. This Root pointer will point to the First entry in sequence.

01 ROOT-PTR PIC s9(3) comp-3 VALUE +0.

There are other support fields that you need to define. I won't go through them here, but the idea is that you keep adding new entries to the end of the entries, and just switch the pointers at the "Point of Insertion".

You would write a simple routine to ADD an entry into the table. Each segment of the program that needs to add data to the table will Perform the ADD-ENTRY routine. So, no matter where the entry is added, the ADD-ENTRY Routine will always insure that the pointers are adjusted to maintain a "Sorted" list based on the Pointer field of each entry. Note that at most, only two pointers (current and prev, if required) will need to be adjusted with each entry added to the table.

So whenever the program needs to add an entry to the table, it always adds the new entry at the end of the current entries by Performing the ADD-ENTRY routine to adjust the pointers and place the new entry at the end of the table.

The ADD-ENTRY routine will Traverse the Table, using the ROOT-PTR as the Starting location, and the WS-T1-POINTER to navigate to the NEXT entry, to find the place to "Insert" the new entry. Remember though, we aren't Really Inserting the entry there. Instead, we just flip the pointers. The new entry is in fact always added at the end of the table.

There are several kinds of Linked Lists, the one that I described is called a Singly Linked List. It's really an efficient way of maintaining an ordered list without the need to SORT the table. If you want to list the table entries in sorted sequence, simply traverse the table using the pointers from the root to the last entry.

Since your table is small (occurs 20), a singly linked list will do fine. If you had many entries, I would suggest a Binary Tree data structure, but for your purposes, I think the linked list is the way to go. Basically, Linked Lists are good for a small, finite set of entries.

Developing a Linked List data structure in COBOL is not very complicated, and is a great way of maintaining ordered lists.
santlou
 
Posts: 15
Joined: Sat Aug 23, 2008 8:06 pm
Has thanked: 0 time
Been thanked: 0 time

Re: how to remove duplicates

Postby dick scherrer » Fri Dec 12, 2008 5:59 am

Hello,

Keeping in mind that the goal is not to put all of the entries into the new array, but rather to flag and elinmiate duplicates.

Also, if there needs to be much searching within the table, the linked list is rather inefficient. Given that the suggested approach is "one way linked" each search would need to start at the beginning and roll forward. An array stored in sequence could be scanned "up" or "down" depending on the value of the next value to be found. While the sort is a bit of overkill for only 20 entries, it would perform quite well if the population of table entries is expected to be large once the prototype is working. Also, if the population of the array becomes large, a SEARCH ALL would work far better than chasing the linked list or going up/down the array in code.

I have used linked lists quite a few times (even on dasd) but in this case i believe it may be a solution in search of a requirement.

There is also the alternative of sorting table 2 "by hand" - not using the SORT/SD approach.

Thanks for trying to help.
You're welcome. As you progress with either direction, post any questions here. Before going much further, i'd suggest you verify whether you can use SORT in an IMS online program.
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Re: how to remove duplicates

Postby pmerc8888 » Fri Dec 12, 2008 12:10 pm

Greetings santlou and Dick,

Thank you both for your replies.

Kindly allow me to add that the array will not grow. Chances are, there won't even be more than 10 entries since the items/"records" that will be put there have gone thru several "screening" and some formatting processes.

I don't believe I've ever used a linked list. I will try to do some research so I can be more specific with my questions but if someone can already provide some detailed code, that would be most helpful.

I will also try to find out if I can use SORT in IMS online program.

Lastly, about
the alternative of sorting table 2 "by hand"
... would you kindly elaborate on this, please.


Many thanks for your time.

pmerc
pmerc8888
 
Posts: 16
Joined: Thu Aug 14, 2008 4:06 pm
Location: China
Has thanked: 0 time
Been thanked: 0 time

Re: how to remove duplicates

Postby santlou » Fri Dec 12, 2008 8:13 pm

I do agree with Dick that a Linked List array is not an effiicent data structure for large volumes of data and for searching. If the real purpose is to eventually grow the table into thousands of entries requiring massive searching, inserting, updating, and deleting of table entries, then a singly linked list is definitely not the solution.

However, I suggested this based on the information provided. There was no mention of searching, which would not be efficient in a linked list. My solution will work very efficiently if all you need to do is flag and eliminate duplicates in an array that contains no more than 20 occurrences. Even if there were a couple of thousand occurrences, this would still be efficient if all you need to do is maintain an ordered list to flag and eliminate duplicates.

Based on pmerc's last reply, I think that a linked list will work just fine. Remember that different data structures have properties that make it specifically efficient or inefficient for specific functions. If you simply need to order a list of values, a singly linked list is an efficient data structure. For searching massive tables, a Binary Tree structure would be better, but that's not what I'm suggesting here.

I've used a linked list structure in several programs where an internal table simply needed to be sorted in a specific sequence. Even with up to 1000 entries, the process worked very efficiently when I monitored the process with tools like Strobe and Omegamon, show absolutely No degradation in performance. If coded correctly, such data structures can be a valuable alternative to such solutions.

Of course there are trade offs and limitations to using data structures in COBOL when dealing with large volumes of data and heavy insert/update/delete activity on the array. The use of data structures is just one atlernative solution and the decision is up to the technician's own testing and experience. The singly Linked List data structure is relatively simple to develop, not require any additional resources, and will be an effective and efficient solution for this specific purpose.

There are many other alternatives to this issue, this is simply just one alternative...

Good luck..
santlou
 
Posts: 15
Joined: Sat Aug 23, 2008 8:06 pm
Has thanked: 0 time
Been thanked: 0 time

Re: how to remove duplicates

Postby dick scherrer » Sat Dec 13, 2008 7:21 am

Hello,

Lastly, about
the alternative of sorting table 2 "by hand"

... would you kindly elaborate on this, please.


There are a couple of ways to do this "by hand". One is to write your own "bubble sort". There are many examples of code and pseudo-code on the internet - probably more work than is worthwhile for this, but might be educational when you have some time.

The other (IMHO easier) is to simply put the new entries in the table in sequence, flagging duplicatges as you go. Once the first table is complete, point back to the "front" of that table as well as the "front" of table 2. The code would compare the table 1 entry to the table 2 entries. If the table 2 entry is blank, move the table 1 entry to the blank table 2 entry. If the table 1 entry is a duplicate, set the dup indicator in the table 2 entry. If the table 1 entry is less than the table 2 entry, "slide" the table 2 entries over to make a blank entry and move the table 1 entry to this place in the table. There would be to "loops" - one to move thru the table 1 entries and the other to "slide" the table 2 entries when an insert was needed.

As with the linked-list, just another alternative.

Initially, i understood that learning SORT input/output procedure was a goal. For the amount of data to be handled, SORT is overkill, but at some point you want to learn and be comfortable with input/output procedure.

Good luck :)
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Previous

Return to IBM Cobol

 


  • Related topics
    Replies
    Views
    Last post