Founder @ Zingg.AI
September 25, 2021
Need for data mastering
As data practitioners and business leaders, Garbage In Garbage Out (GIGO) is an unfortunate truth we deal with every day. To be truly data-driven, access to clean, consistent, and unified data is mission-critical. Our data management systems and processes have to be reliable so that our business decisions can be drawn based on an accurate view of the activities happening in our organization.
Unfortunately, due to how our core data management systems are structured, the data about core entities like customers or suppliers get fragmented. The data silos, containing versions of the truth, are not unified leading to a loss in business agility, reduced trust in analytics, poor operational efficiencies, and a limited ability to plan budgets and forecast different business scenarios. What we need instead is reliable and unified data that we can leverage for each business activity – customer outreach and personalization, sales, and marketing, supplier relationships, diversity, and risk planning in the supply chain to name a few.
Data mastering or master data management– establishing a unified view of core business entities, is a fundamental process to achieve this.
Challenges in data mastering
Unfortunately, despite utilizing top-rated analyst solutions and spending millions of dollars on design and implementation, most master data management (MDM) systems do not provide a return on investment unless supplemented with substantial human capital. It is estimated that on average the people costs of an MDM project are four times that of the software license cost, indicating that system integrators, freelance MDM consultants, and internal staff are needed to deliver the best value from MDM solutions.
While some degree of costs associated with full-time employees or experts is to be expected due to the collaboration of different stakeholders and business units, the human costs amounting to 4x of the software license costs indicate that something is amiss with the base MDM package. Clearly, existing MDM products have failed the promise of self-serve configuration, lightweight architecture, and easy deployments. They have failed to leverage new technology such as Containers and the Cloud to ease data management.
Traditional data mastering has been predominantly rule-based, where data preprocessing, cleansing, profiling, standardization, fuzzy rule creation, tuning, trust and validation rules, etc take up a chunk of time. Rule creation and tuning is labor-intensive, time-consuming, and takes specialized data mastering tool experts, who first understand the business context and requirements and then craft the rules. As the results can only be validated once the rules are deployed, the business expert’s feedback comes much later in the implementation cycle. Any minor change pushes the timelines out further. Due to the way traditional data mastering systems were designed, the tools force the deployment into long implementation cycles.
In addition, defining rules is difficult to generalize to newer sources of data. Thus, we often see that data mastering systems only master a few key systems and the promise of unified data over the entire enterprise dissipates as new sources are added.
An open-source and agile approach to data mastering
Instead of this waterfall approach of data mastering, it is time to leverage the agile methodology and use an open-source stack, both of which have seen huge success in software development.
Agile data mastering, an iterative and incremental approach to data mastering, allows us to successfully engage domain experts and business users directly in the data mastering process, using their feedback to continuously learn and scale the system. This approach towards data mastering ensures that we get timely access to data, prioritizing core domains and systems and incrementally adding more to the data mastering process. Instead of spending months collecting requirements from different business units, agile data mastering enables us to quickly roll out mastered data.
Leveraging open source is critical too. Open source allows us to have control over our architecture and is far cheaper than a full-blown proprietary MDM, of which half the features never get used.
Introducing Zingg: Open source data mastering at scale
Zingg is an open-source tool that leverages AI to quickly master different records belonging to the same party. Zingg scales to millions of records at ease and can be used as the key building block for a custom data mastering solution.
Other modern open-source data tools like Nifi and Airbyte already enable us to quickly extract and load data from source systems. When we process this extracted data through Zingg, we can quickly build the relationships between the records. Zingg lets you choose the data store of your liking – RDBMS or NoSQL to become the single source of truth. A lightweight custom data stewardship can then be built atop this, catering to your business flow.
Built with open source technologies like Apache Spark, Zingg is a modern take on the fundamental building block of data mastering – data matching. Zingg is built to scale, and quickly enables data mastery for pipelines of millions of records.
You can check out Zingg here.
For enterprises that have already invested in an MDM tool, agile data mastering and open source can be a quick way to augment their existing MDM tools and enhance the data mastering process to the scale and variety of their source systems and entities. For all others, it can be the simplest way to leverage their data assets and make informed decisions.
Search engines showed us that information is accessible to each of us, it is time our data mastering systems gave us the same access to the enterprise data. Instead of a standard product with years of rule definitions and customisation, what is needed is a fresh approach towards data mastering, where we build and deploy iteratively. It is time to build a modern MDM using open source. One that fits in with your stack, your data, and your way of doing things.
What do you think?