Fishing for Information in a Data Lake

How does fishing for information in a data lake differ from searching for information in a data warehouse?

Data fishing, also referred to as data dredging, is used to identify statistically significant relationships between data using data mining tools. Data warehouses organize data based on known relationships between data items. A data warehouse is organized to support the type of analysis that will be performed on the data and to optimize searching for specific data. A data lake stores data in a flat architecture in which data are assigned unique identifiers, rather than using a schema to organize the storage of the data.

The data lake contains data from different functional areas and allows the data to be explored freely, without the conceptual boundaries and preconceptions. Unlike a data warehouse, the formation of a data lake is not based on a prediction of how the data will be analysed. A data lake has the advantage that it removes the physical barriers of applications and schemas that help to contain data in silos, hindering access to data. However, it is not only physical barriers that keep data in disparate silos; a range of cultural factors and problems relating to data definition also need to be addressed.

The power of the data lake is that it brings together data that has traditionally been stored and processed separately. This means that:

When fishing, the larger the lake the larger the fish.

More data from diverse sources provides the opportunity for unexpected relationships between data items to be identified. However, fishing for data only catches data, further analysis of the data relationships hooked is needed to derive meaningful information.

Tips for fishing for information in a data lake are:

  1. Know where to look for the type of fish that you want. The size and diversity of the data lake will affect the range of information that can be netted.
  2. Invest in refreshing the lake with quality healthy data.
  3. Choose the right equipment and bait to tempt the data relationships to your hook. Tools are needed to explore the breadth and depth of data in the lake.
  4. Check licensing requirements. Ensure that issues such as privacy, confidentiality and data ownership are not breached when preparing or using the lake.
  5. Approach with an open mind, do not prejudge what you might catch.
  6. Review your catch carefully and throw back the red herrings.

The rewards of fishing are both the catch and the experience of being with nature. Fishing for information in a data lake allows you to freely explore the organization’s data in an open environment away from organizational structures and policies. Enjoy the experience of exploring your data in a different way. Happy fishing!

Further Reading: the challenges of information silos are discussed in Chapter 7, and legal and ethical issues in information management are discussed in Chapter 5.

Please use the following to reference this blog post in your own work:

Cox, S. A., (2014), ‘Fishing for Information in a Data Lake’, 20 June 2014, http://www.managinginformation.org/fishing-data-lake/, [Date accessed: dd:mm:yy]

 

© 2014 Sharon A Cox