Big Data doesn’t always mean Good Data

Big data analytics is at the forefront of several of our most important industries. From healthcare to manufacturing, supply-chain management to finance, analyzing huge quantities of data to produce valuable business insight has become the norm. Data analytics can yield essential information to predict financial trends, streamline a company’s operations, understand target sales markets, even cure diseases. But with the countless terabytes of information flowing through the stratosphere, an important question arises: How much of this data is actually good?

Good data refers to clean data; data that doesn’t contain crucial errors which render it useless, or even worse, harmful to the organization itself. This data is referred to as, not surprisingly, dirty data. And although the term may seem playful, the effect of this dirty data is decidedly not; it is estimated that bad data costs US businesses $600 billion annually, not to mention decreases productivity, wastes resources and can potentially damage a company’s credibility. Dirty data is a serious problem in today’s market. An important goal of any business using big data analytics is to understand where it comes from, and how to prevent it.

Dirty data can take many forms from a wide variety of sources:

  • Incomplete data is the most common occurrence. When important fields on master data records are left blank, those fields cannot be segmented.
  • Duplicate data is another very large culprit. Duplications can occur in many situations, repeated values for customers and data both skew the data analytics.
  • Natural decay occurs in some data sets such as those involved with marketing. Customer information degrades based on normal, expected change. People move, leave jobs, change addresses; most data is said to degrade 2% per month.
  • Incorrect data happens when field values are created outside their valid range. An example of this is when a month field shows 1 to 12, any number outside of that would be incorrect.
  • Inconsistent data is found when the same field values are stored in several different places, out of sync data often leads to inconsistencies.

A sobering fact of dirty data is that it is an inevitable factor arising in big data analytics. It is literally impossible to remove or avoid completely. Dirty data is here to stay, but there are ways to make sure your big data is as clean as can be.

Once your company has determined the source(s) of dirty data, the daunting task of data cleaning must be implemented. You must apply meticulous data cleaning algorithms to your data sets which look for duplicates, correct errors and update information. Once previously obtained data has been cleansed, the focus should shift to considerations regarding how data is collected and maintained. Introduce technology to ensure reliable and secure data collection, with numerous checkpoints to promote accuracy. Create data purity standards that are shared with all relevant parties on your staff. It is important to remember that data-driven business is in a constant state of renewal; data collection and application is growing exponentially, which means that your organization must be sure that data management platforms align with future plans for growth.

Dirty data is one of the most important factors in a data-driven company. Big data analytics is one of the most exciting and impacting fields in business, and data cleansing is the key that makes it all work. So scrub up, and make sure that data is spotless.