Big Data - a harvest disaster or the golden grail? Practical methods for harnessing the potential of data

How to overcome difficulties in collecting and using data?

On the market, the number of companies that appreciate the importance of information and excel in collecting very large amounts of data is on the increase. Unfortunately, surprisingly often it is a challenge to use them properly.

Big data is a concept that has entered the consciousness of almost every industry in recent years, and using the information to increase a company's competitiveness and productivity is increasingly on the list of strategic goals. Many companies have taken data collection very seriously and now find themselves already in possession of massive collections of valuable data - but this is not synonymous with the ability to harness its potential. 

Impediments to data exploitation

Today's organizations that have already amassed sizable data records often face a wall that makes it difficult for them to turn the information into helpful insights or guidelines for business growth - this is usually due to a lack of structure and organization of the data at various levels. 

A harvest disaster 

One of the problems that businesses face is what could be called a harvest failure: the business has multiple data sources, all of which contain a similar range of information, but the data is so inconsistent that it prevents it from being used correctly. A similar problem that significantly impacts the usability of the data is the lack of documented and accessible logic contained in the raw data or the lack of consistent terminology and definitions of the characteristics that describe the core business metrics. 

Culture of the organization 

Another aspect that significantly reduces the potential profits from data lies in the culture of the organization. If a company does not enforce validation of decisions and actions through sound data analysis, or if the decision-makers do not have sufficient skills and proper 'hygiene' in working with data, then you can be almost certain that the business is not taking full advantage of the opportunities that lie in its databases.

Each of these problems separately can cause serious difficulties, and sometimes companies even struggle with several of them at the same time. The longer they remain unresolved, the more they can result in unpleasant consequences and even the loss of market advantage to companies that effectively and efficiently manage and use the information they collect. Lack of data organization also leads to an uncoordinated or mutually contradictory understanding of the effectiveness and correctness of decisions, customer needs, or market direction, resulting in inconsistent actions taken by independent units of the company, which in extreme cases may work against each other. All these difficulties ultimately lead to a breakdown in the organization's culture and a vicious circle - problems result in a loss of trust in data, which in turn leads to an exacerbation of the severity of the aforementioned problems.

Data management troubleshooting issues

A fundamental step that significantly improves the way a company uses its collected information is to create and share a comprehensive and easily accessible data model. Creating business growth strategies, expanding or improving delivered product functionality, and discovering new customer acquisition paths are just some of the areas that can benefit from efficient and effective management of collected information. Moreover, supporting decision-making processes by providing consistent business insights is essential for creating an effective strategy and synchronizing activities across different areas of the company’s operations. 

Organizations that want to fully leverage the potential of collected information, reduce the time required to build business cases based on data, and strive to be fully data-driven will need to focus their attention on modeling the available information. The process of collecting and structuring information can be divided into four key stages, which significantly help an ineffective organization of data flow.

  1. Data Source: Depending on the type of organization and services provided, as well as the defined analytical needs, there may be a need to use a variety of data sources. Very often organizations obtain a key part of their data from a product or transactional database (e.g. MySQL, PostgreSQL). Often there is a need for integration with behavioral data (e.g. Amplitude, Mixpanel) or obtaining raw data available in used SaaS (Software as a service) by creating integration with API. 
  2. Data Lake: In order to effectively manage data acquired from disparate sources with inconsistent characteristics (i.e. scope and purpose of data, different temporal resolution, level of aggregation, variability of data over time, etc.), a space that allows data to be collected in one place is essential. The flexibility of the environment (i.e. easily increasing the amount of available space) is important, as is the ability to pre-structure and partition the data. It is worth noting that Data Lake collects raw data without applying any transformations after it has been extracted from the source in a 'schema-on-read' approach. 
  3. Data Warehouse: Gathering all data in the Data Lake stage allows you to access the necessary information from a unified interface. Still, the data requires structuring schemas, making variable and characteristic names consistent, resolving data quality issues, and building a unified and consistent model. A key goal of this stage is to centralize assumptions and unify the method of counting the most commonly used characteristics. The most important role of this stage is to facilitate access to data for business users without significant experience in working with SQL - reducing the need to combine, filter, and aggregate data - ultimately building 'self-service' analytics. In the case of changing business definitions, we gain the ability to adjust assumptions in one place and update and distribute consistent information to all users and applications. 
  4. Data Mart: In many cases, a well-structured and documented Data Warehouse that is a 'single source of truth' is able to meet the analytical needs of all teams within a company. Often, however, as data availability, demand, and awareness increase, the size of the Data Warehouse can be overwhelming to the end-users. Additionally, in many units of the company, there may be a need to use already available data in a different form (specific to a particular unit) or in combination with a different set of data (not applicable to other departments). This may result in the need to create dedicated analytical spaces for the needs of individual business units.  

Each of the above-mentioned stages requires the use of appropriately selected tools that facilitate processing, structuring, testing, or documentation of the collected data. An efficient implementation of Data Warehouse deserves special attention. The key issues that should be considered during its design are:

  • Choosing a tool that allows effective development, documentation, and testing of the data model you are creating. DBT, for example, is worth considering in such a case. (https://www.getdbt.com/)
  • Streamlining assumptions about project structure, object naming terminology, or SQL formatting. A valuable resource on this topic is: https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355
  • Ensuring that a development environment is set up to test the correctness of changes made without unnecessary interaction with production data.
  • Choosing an efficient and convenient form of data structuring for a Business Intelligence tool. It is very common to find a model of dimensions and fact tables, as well as an OLAP structure based on it. In the case of dynamically changing business requirements and analytical needs, it may not always be profitable to invest in building a complex final data structure. In such cases, the 'wide table' approach may prove to be the solution.
  • Due to the diverse needs of end-users for BI analytics, many times freedom is required for final data modification by BI analysts. The BI tools used are naturally capable of implementing such modifications, but very often modifying data at the SQL level is much less labor-intensive. For BI tools, it is, therefore, worthwhile to make data available in view form. 
  • Views used by BI analytics that are based on previously prepared tables in Data Warehouse do not have significant performance problems. For very large data sets, it can be helpful to materialize the view as a materialized view or table. 
  • There may be particular concerns about the performance of queries generated by BI tools when using the 'wide table' approach compared to the OLAP structure. When juxtaposed with increasingly used databases (Snowflake, Bigquery) the 'wide table' approach can prove to be up to 25-50% faster than the standard dimensional model and fact tables. (https://www.fivetran.com/blog/star-schema-vs-obt)

Data modeling and organization is a vast, somewhat overwhelming topic that requires a lot of planning and works - but ignoring the need for this process is very often a major source of the problems organizations face on their path to full maturity. That's why an investment of resources in good, thoughtful structuring and modeling of available data will almost always more than compensate the company for its implementation.

Share this article: