Now that the whirlwind that was 2016 is behind us, forecasting what’s next for data seems like a tall order - many of the supposedly sure-fire predictions from this past year were proven embarrassingly incorrect. So in looking ahead towards 2017, we must start our predictions on safe, solid ground. And because of recent, steady developments and trends across all industries, it is a reasonably safe bet to say that the importance of comprehensive data governance will continue to grow in recognition across an increasing number of industries through 2017.
Data Governance in Banking
Thanks to comprehensive and challenging regulatory requirements originating from the Basel Committee on Banking Supervision in 2013, the banking sector generated a range of challenging requirements for data governance capabilities that many software vendors have sought to meet. In addition to this, the modern data landscape within broader enterprises has brought into focus how the benefits of developments such as the advent of data lakes and end-to-end business led initiatives (encompassing data wrangling, analysis and visualization with often minimal IT oversight) are often countered by the drawbacks of lesser central visibility and associated perceived loss of control and confidence. There is no capability or widespread demand to put the genie back in the bottle when it comes to these developments, but rather those new capabilities are sorely needed to maintain and even improve confidence in the veracity, provenance and quality of data powering the analytical maelstrom within the modern enterprise, and between them and their customers.
Centralised visibility of metadata, and the means to augment this metadata with additional concepts is integral to each key vendor's solution approach. The distinction between technical metadata (for example the table and column definitions including names and data types; and the lineage of data flows from source systems through to end points such as reports and dashboards) and business metadata (for example business concepts and terms, glossaries and associated owners) remain absolute, with the primary challenge being bringing these domains together in a usable, intuitive manner that ideally can satisfy the needs of both a Chief Compliance Officer who is seeking the definitive definitions of his bank's products, and a software architect who needs to know the downstream visualizations that may be affected if she implements a change to the structure of a Teradata view.
The inherent advantages of having a Data Governance approach which provides a clear metadata landscape are long established, but the advent of Data Lakes, the Hadoop ecosystem more broadly and the proliferation of cloud hosted architecture has brought renewed focus on the implications of this degree of distribution and ever increasing variety of data assets. Simultaneously, the Hadoop ecosystem lends itself well as the foundation of a modern data governance architecture, with its storage, compute and machine learning capabilities. Hortonworks and Cloudera both offer data governance abilities within Apache Atlas and Cloudera Navigator respectively, while all the significant players in the data management space are now hosting their capabilities on Hadoop, with the obvious intent that the Data Governance of a Data Lake would live in the very same infrastructure.
Major Players in Data Governance
Waterline data leverages automated data 'fingerprinting', employing machine learning capabilities to attempt to categorise technical metadata with its matching business term. As with all mature data governance/quality solutions, dedicated human task workflows are embedded for verification, and collaboration around unmatched entities.
Another solution from Attivio maximizes on their strengths through embedded text mining analytics and semantic search capabilities, bundling a Hortonworks HDP instance for those customers without a major Hadoop distribution already present.
Similarly, Collibra Data Governance Center’s flexibility, artificial intelligence-assisted automation, and configurable operating model are tailored for the new big data world. The data catalog is linked to business terms and machine learning technology recommends appropriate data. It also includes analytical models, map/reduce jobs, queries, and provides visualization of any type of relationship, including lineage relationships and context.
An expansive approach has been employed by Informatica, who have historically been very focused on data quality and governance capabilities. This continues with their relatively recent range of data governance solutions employing the Informatica Platform and Hadoop native services. Focus on new capabilities has shifted from the existing Business Glossary and Metadata Manager products under the venerable PowerCenter brand to distinct products under the Intelligent Data Platform architecture. Capabilities start with a modern cataloguing solution (Enterprise Information Catalog), with shared priority in initial releases to support both Data Lake housed and more traditional data assets such as on-premise or cloud hosted databases and data integration hubs. This forms the foundation of additional capability around identification and tracking of secure data ([email protected]), more critical than ever in Europe in particular with the advent of the demanding GDPR regulations, which feature highly on any risk analysis of organizations processing customer data.
In addition to meeting governance challenges Informatica have considered some of the current state complexity when it comes to exploring and working with Hadoop hosted data, which very much remains dependent on coding. Data analysis and preparation capabilities have been brought into the solution, making use of functionality previously seen in the cloud based Data Preparation/Rev solution. This is aimed at analysts and data scientists, allowing application of standardisation and other DQ rules on the data in real-time, generating 'recipes' to be productionised within the data integration space.
While Informatica are establishing ambitious pillars of capability, it is impossible to ignore Microsoft's entry to the market with the Azure Data Catalogue. Hosted exclusively on the Azure platform, this solution complements the disruptive PowerBI data visualization offering, and is provided on a subscription basis, with of course no infrastructure requirement. Supported data sources include cloud and on-premise assets, and the development of this offering is certainly one to watch.
As the range of data governance capabilities increases to grow and mature, we expect more (and increasingly smaller) organizations to seek to gain the confidence and assurance that comes with a comprehensive data governance strategy, achieved through dedicated tooling. A dynamic and competitive marketplace will continue to evolve through 2017 and suppliers may well be confident that for organizations in an environment where regulatory pressures are not acute, competitive pressures will fill the void and focus attention on the benefits of navigating the modern enterprise data landscape.
Senior Manager, Eccella