Does Machine Learning Require a Data Lake?

Does Machine Learning Require a Data Lake?

As the marketing machine cranks up for Industry 4.0 and the industrial internet of things (IIoT), leaders in many organizations are faced with a barrage of buzzwords and the associated high-cost solutions. For example, I was asked the other day if an organization needed a data lake in order to have machine learning. Machine learning is a functionality with many benefits, so let’s explain these concepts and how they work.

 

By: Tim White

The term data lake is used to define a place, usually off premises, that is used to store raw, uncleansed or unorganized data. It is a fundamental concept of data management. In contrast, a data warehouse is a repository of structured data that has already been processed and organized. With the continued rise in big data, the use of data lakes have become very common. While this is normally thought of as being synonymous with cloud storage, data lakes can also reside on-premise in a company’s own data center. What makes data lakes unorganized is the fact that multiple data sources are all stored in the same “place”, allowing for easier access for the end users.

"What makes data lakes unorganized is the fact that multiple data sources are all stored in the same “place”, allowing for easier access for the end users."

Defined at a high level, machine learning is a subset of artificial intelligence. It is a technique used to “teach” a computer to make decisions directly from data without using a predetermined calculation or algorithm. Whenever there is a complex problem involving large amounts of data, but there is no existing formula or algorithm, this is when machine learning should be considered. The key word here is “data” and the more you have, the better the performance. This is why the idea of machine learning is often connected with an assumed need for a data lake.

You can use machine learning to get real benefit now without a data lake. If analytics are being performed on a single data set, then you probably do not need a data lake for machine learning. A single computer or server, with the right resources, would be able to store data and perform the algorithms necessary to return the information requested. A good example that many organizations could quickly benefit from would be using control system data collected by a data historian to determine the remaining useful life of an asset. We can use high powered desktops to train and test newly developed models with this exact capability. There is far more opportunity in the data you are currently collecting than you realize. Additionally, starting with a simpler use case like this can help to build interest and enthusiasm in the organization for supporting larger efforts.

"A single computer or server, with the right resources, would be able to store data and perform the algorithms necessary to return the information requested."

Often, information can be much more enlightening when data is correlated from multiple applications across an enterprise and possibly even from the internet. With multiple data platforms and formats, a data lake may be required to store the information. Remember though that “data lake” does not necessarily equal large hosted cloud or data center solutions unless we are talking about large scale, enterprise initiatives.

Data lake is only a term that defines an approach to data management. Most IT departments have a policy in place for how data is to be consumed, stored, and backed up for the organization. A meeting with them to understand those policies, communicate your business requirements, and work out a solution collaboratively will go a long way to realizing success long term.

So, do you need a data lake in order to have machine learning? Not always. As I described above, you can get started with some very beneficial use cases to analyze your equipment health without a data lake. What is more, starting small is a great way for the organization to learn valuable lessons and build enthusiasm and support that will make larger implementations, such as creating a data lake, much more successful.

 

 

About

Tim White, Senior Manager

Tim White is a Sr Manager at T. A. Cook focused on providing services related to Digital Asset Performance Management. Previously he worked in industry as a Global Director for Asset Management, responsible for 83 sites across the globe. Tim brings this real-world experience with him as he engages multiple clients to help them solve their asset management and maintenance strategies.

“Digital is at the heart of our operations”

  May 2018

Heavy industry is witnessing wide-scale technological change and digitalisation is at the heart of

How to Manage Your Master Data

  October 2020 / Jennifer Adams , Marketing Coordinator

Master data refers to information shared across a company that is critical to business operations.

Are humans making themselves obsolete?

  July 2018

Empty factories and increasingly autonomous machines: Those who only envisage the modern working

From Vision to Reality:
 Intelligent AM

  June 2020 / Alice Zhang , Senior Consultant

Predictive maintenance, digital twinning, mobile solutions, intelligent supply chain management,