I apologize in advance as this question is likely to be more on the theoretical side, with (hopefully) a fair bit of discussion.
This is the scenario: We have Salesforce and a multiple local provisioning systems (short history - multiple companies were merged). As part of an effort to standardize, we setup an ETL that pulled the data from all of the local systems, transformed it and uploaded it to Salesforce. Currently we use Salesforce mainly as a Ticketing system and there is no write-back to the Local Provisioning system. The data is refreshed in Salesforce every 2 hours.
The function that the data in Salesforce provides is that as customer calls or emails come in, they are matched against Contact information (pulled from the local provisioning systems).
Our Marketing team is currently wanting to implement Salesforce Marketing cloud, which pulls data from the Contacts table.
Now you have the background - here is the issue:
Since we have a lot of 'duplicate' data - the 2 most common scenarios are:
Person A has an account with Brand X and Brand Y. The ETL sees 2 unique contact objects, assigned to 2 different brands, 2 different accounts with 2 different products and imports them as 2 separate contacts, even though the relevant details (Name, Email, Phone number) is the same.
Person A has multiple accounts within a Brand. Sometimes this can be due to having multiple businesses or they are a reseller but aren't using a Reseller login or whatever - in which case the ETL sees multiple unique contacts (as within the Provisioning system, the contacts have a unique and different primary Key) - and as above, the ETL sees this is multiple unique contacts, belonging to different accounts, having different product sets.
This causes an issue with the Match rates for case generation and it also potentially causes an issue with the Marketing tool that we want to use - in particular an Individual with Multiple accounts might receive multiple copies of Marketing material.
The Marketing team want to just De-Dupe the Contact data, however whilst I have a few theories as to how to do that, that would either:
1: Lose relationships between the local systems and Salesforce - There are some ancillary issues that may be generated here since Products are bound to contact information and when a case is created, the associated products are presented to the Agent.
2: Create some form of relationship mapping to try and preserve things - this may get extremely messy and may end up crossing Brand/Account boundries, which IMO could create Legal/Privacy issues if anything went wrong. Not to mention it creates something that is not pulled from the Source provisioning system.
Currently, I've designed this process with the mantra that what is in the source system must be replicated to Salesforce, but with what I'm being asked to do, will break that.
So - to sum up and ask a question: Are there any 'rules' about De-duplication and Data Integrity I should be aware of or reference? How would you architect such a requirement? Any Pitfalls or 'gotchas' or similar I should be aware of? Should I even pursue this course of action?
Edit:
I'm looking more for some guiding principles as opposed to a specific answer - I appreciate that it's a bit broad, however Breadth is what I need right now, I should be able to work on the detail, but right now I need some help on the big picture.
First answer of find out what the business needs is a good one - so I'm going to sit down with the Marketing team and other stakeholders and see what we get.
More information and general principles/advice like that is appreciated.