The issue of finding duplicate records, record linkage and updating dependent tables is a very complex, critical subject for an enterprise database. I do not claim to understand it but some issues that I have come across and can recall are:
1. You cannot rely on one column of data. For example Last Name, Social Security Number, even address fields. etc. For example Smith might be Symith, etc. can you rely of a Soundex, Metaphone, etc stored procedure? How about First and Last Name? William or John Smith are enter and validated by validation rules and are persisted to the database as different persons when in fact they are the same. Add address fields to the match algorithm...does an old address produce a different person compared to the newly entered address for a person of the same name? add SSN...the user is off by 1 number....certainly you can use a regular expression to filter/validate SSN's which valid i.e. do not break rules of the Social Security agency which has more rules than just requiring a 9 digit number. SSN do not have a check digit to exclude SSN that do not break any rules. Mother's maiden name and other fields my be tie breaker. Nevertheless, a robust sytem should have a validation algorithm that to prevent duplicate person or other entries which returns a probability of possible duplication and appropriate alerts to the end user. The Census Bureau and others use such a matching algorithm. Anyway, a robust master person index relying probalistic algorithms is the way to go.
See these open source probalistic algorithms. Perhaps, IdeaBlade can reverse engineer one of them into managed code as an add-in module.
2. You should have a methods of discovering and cleaning your database of duplicate or inappropriately merged records subsequent to persistance. Discovery of duplicate/and incorrectly merged records and acting upon them by merging, unmerging incorrect merges, deleting, etc.
3. Off the subject, aside from prevention of duplicate records and database scrubbing verifying the data is correct is another set of issues. I have gleaned from my reading is that Validating is a filter based upon business rules. Whereas Verification is the process that the entered data is correct. John Smith with SSN 999-99-9999 actually lives at 123 Hogs Hollow or even if the address exists. This Verification may come from one or more third parties such as US postal service, Melissa Data, Axicom, even credit agencies, etc. via a web service. What if the person has moved you will need some form of Skip Tracing usually provided by credit agencies. Once Verified the records should be updated with the correct information.
I would like to hear others comment on their ideas of this problem.
Edited by WildHog - 15-Apr-2008 at 6:19pm