Blog Details

Solving the dark data problem

Vishal Krishna

A decade ago organisations just discovered the possibilities of crunching data at scale. Punters called it the era of big-data and then began the narrative of crunching structured and unstructured data.

While the narrative shifted to Artificial Intelligence & Machine Learning organisations began to grapple with a larger problem with the explosion of information. They began to store information in digital archives, yet not knowing what to do with them. Today we are living in the era of dark data.

Think of it like Frank Zappa's "Vault" where even to this day there is so much music that musicians and musicologists are discovering new material all the time.

Today, organisations generate so much data, which eventually end up in a vault, waiting for someone to make sense out of them. It's one of the largest business opportunities since big-data & AI technologies have become mainstream globally.

For those who want to know what the technical meaning of dark data is:

Dark data is all of the unused, unknown and untapped data across an organization, generated as a result of users’ daily interactions online with countless devices and systems — everything from machine data to server log files to unstructured data derived from social media.

It is obvious that organisations use AI & machine learning to crunch dark data to put them to good use in business.

IBM's Datacap, Google's Cloud Vision and AutoML, and Microsoft's Azure Cognitive Services are some of the technologies used by dark data practitioners.

Just remember that if you don't use your data, it is prone to security risks and theft, which means you have to invest more in securing items that you don't know what you are doing with. What is sensitive data is stolen - today data sitting in public clouds and stored away can be prone to higher risks of attack.

All data now needs to be compliant with local laws of every country, including GDPR, dark data can expose the company during audit by regulators. So please beware of the data you store, put it to good use.

Who is putting dark data to good use?

Genpact and Envision Racing have created a novel way to use data received in the form of audio streams from racing tracks.

Envision Racing is a leading e-racing team which was able to draw insights from alternative data sources, such as GPS and radio. By cleansing and analyzing publicly available GPS race data, for example, Envision Racing createe heat maps that reveal rival drivers' tendencies on the track.

Race strategists can then identify the drivers who are likely to over-consume energy and how they might behave on different parts of the circuit, giving the Envision Racing drivers a decision-making edge.

While AI and machine learning were used, the teams of Envision and Genpact required skills like Docker, Kubernetes, Java, and Python, as well as NLP. On top of all this the teams added an MLOps architect to manage the complete process.

While the first use case covered the future of racing, which is electric.The second use case delves into an industry which defined the last 150 years, which is the oild & gas industry.

The oil & gas industry records subsurface data and this data is recorded in tapes for more than 40 years.

Oil and gas companies collect data at various scales from a few miles to hundreds of miles to small tiny samples of rock being drilled. This data is stored offsite in taps and has no benefits of being used digitally.

The cost of storing data in tape vaults on a cost-per-Gb basis can be high according to AWS.

Tape Ark, a IT company, realised that data locked on legacy tape needs to be used by the industry when going through digital transformation. It worked with AWS to use the cloud to make tape data digital.

The partnership enabled creation of a high-level workflow solution which starts by receiving media and performing a detailed tape media audit. This allows oil and gas companies to predict the cloud footprint they will create from their data at a granular level, seek out duplicates, and remove data for ingest that may be part of a joint venture (JV) or belong to a third party. This method ensured scalability of data in the oil and gas industry with dark data.

Coming to India

Now let's look at the fintech opportunity in India as the third use case because Indian Banks will spend on IT transformation where crunching data will be of paramount importance.

According to Gartner, Indian Banks will spend a lot on IT to manage core banking, loans and consumer experience. A bank on an average has more than 200 applications and data is stored in several formats. Banks don't recognise their most loyal customers today digitally and customers receive so many bank calls that it creates customer animosity over the experience.

Gartner forecasted that IT spending in India was $101.8 billion in 2022 an increase of 7 percent when compared with 2021.

Going by this forecasted number, fintech companies, in India, will have to set policies to use technology that can crunch all forms of data at scale before dumping what is not necessary.

So summing up; dark data just lying in a data centre somewhere is not good for the enterprise because of security and business risks.

Neither is it good for the planet because of the energy being consumed to manage unused data.

Either way it is clear that organisations today need to use every bit of their data be it for reporting and compliance or even scaling up their business. It's just not prudent to hoard data anymore, one must use it to scale the business.

According to CB Insights with over 175 zettabytes of data expected by 2025, data centers will play a vital role in the ingestion, computation, storage, and management of information.

What will then play a larger role would be those platforms that can enable seamless data flow between enterprises and vendors by using data which previously everyone thought had to be shelved.

Don't be in the dark anymore, lead your data into the light.