Implementing an Enterprise-wide Data and Analytics Strategy

Brikesh Kumar

Apr 8, 20246 min read

In our previous blog post, we explored the concept of the "data deluge" - the overwhelming amount of data organizations collect today. We discussed the data lake as a central repository for this data. You have probably heard the frequently quoted saying "Data is the new oil" from the British mathematician Clive Humby. However, just having the oil reserves in the ocean or under the ground doesn’t give any value. It’s the process of exploration, refining, and transportation which creates value and converts it into useful products.

Similarly, merely having a data lake isn't sufficient. To genuinely harness the power of your data and convert it into actionable insights, you require a comprehensive, organization-wide data strategy.

From Dusty Records to Data-Driven Hits: How Reliable Records Found New Harmony – A short story.

In the bustling city of Tec-haven, 'Reliable Records,' led by the music enthusiast Michael, enjoyed its status as the go-to place for music lovers, offering an expertly curated selection of CDs and vinyls. However, the rise of 'SpinCity,' a cutting-edge digital music platform spearheaded by the innovative Sarah, threatened their reign. SpinCity's secret weapon was its use of data to personalize music recommendations, a strategy that began to lure away Reliable Records' clientele.

Feeling the heat, Michael convened a meeting to tackle the challenge head-on. A suggestion from David, a perceptive intern, sparked a transformation. David pointed out the untapped potential in analyzing the store's own customer data to refine their offerings. Taking this advice, Reliable Records embarked on a data-driven journey, uncovering a demand for niche music genres and adjusting their marketing and product strategies accordingly.

In their transformative journey, Reliable Records harnessed a variety of data to make their turnaround. They delved into purchase histories, meticulously tracking which albums and genres were selling and which were not. In-store browsing habits provided another layer of insight, revealing what customers were interested in, even if they didn’t make a purchase. They also looked into the frequency of customer visits, identifying patterns that suggested when and why certain customers returned. Social media interactions and online search trends offered a window into the broader musical interests of their customer base, helping them predict emerging trends.

This pivot not only rejuvenated sales but also propelled Reliable Records into the digital realm with an online music store that leveraged data insights for personalized recommendations. The shift was monumental, marking Reliable Records' evolution from a traditional music store into a data-savvy market leader. Their success story became a beacon in the industry, illustrating the transformative power of embracing data in navigating the digital landscape.

This fictional story reflects a common pitfall in real-world businesses: focusing solely on the "how" and neglecting the "why." Getting entrenched in the "how" (like selling CDs or cartridges) can blind companies to emerging opportunities. Look at Kodak and Nintendo, both founded a century ago. While Kodak became fixated on the "how" of film photography, Nintendo constantly innovated and adapted, focusing on the "why" of entertainment. As a result, Kodak was swept away by the digital tsunami, while Nintendo thrived.

Mindset shift

For decades, applications have been built with their own data models, locking data within the application itself. The teams maintaining these applications would also handle the associated data and data models. Integration between applications is typically achieved using data warehouses, lakes, and APIs. This approach, known as the "application-centric" approach, leads to:

Silos: Data trapped in separate applications.
Slow Changes: Updates take months due to rigid structures.
High Costs: Integration efforts add complexity and expense.

Most large enterprises have hundreds or thousands of application silos, resulting in rigidity and requiring users to switch between applications to interact with their data. For example, emails may be trapped in one application, documents in another, browsing history and chat history in yet another, and transactional data in various other applications.

A solution to this issue is to decouple the data from the applications. This new paradigm is called the "data-centric" approach. While this is a broad and rapidly evolving topic, the key idea is that data is not restricted to an application boundary, but rather collected and stored based on the principles of F.A.I.R. (Findable, Accessible, Interoperable, Reusable) as outlined in the previous blog.

If you have the time, it's worth watching this DCAF conference, as it delves into the data-centric approach in depth.

Application & Data Centric Approach in Depth

The Four Pillars of Effective Data Strategy:

"Data is a team sport. To get the most out of data, you need to have the right people, processes, and infrastructure in place." - DJ Patil (former Chief Data Scientist of the United States)

Building a successful data strategy requires a balanced approach across four key pillars: People, Technology, Operations, and Culture. Let's look at each of these pillars:

People: Skills and Expertise: This includes having the right talent with data literacy, analytical skills, and the ability to translate insights into actionable business decisions. Training and development programs are crucial to equip your workforce with these skills. Leadership Support: Executives need to champion the data strategy and provide the resources and encouragement necessary for its successful implementation. Cross-functional Collaboration: Data shouldn't be siloed. Encourage communication and collaboration between data analysts, business leaders, and other departments to ensure everyone utilizes data effectively.
Culture: Data-Driven Decision Making: Encourage a culture where data informs decision-making at all levels. Data Transparency: Promote open communication around data and encourage employees to ask questions and share insights freely. Data Democratization: Make data accessible to relevant users across the organization by providing tools and training that empower them to utilize data for their specific tasks.
Technology: Data Infrastructure: This involves the tools and platforms needed to collect, store, manage, and analyze data. This could include data warehouses, data lakes, cloud storage solutions, and data analysis tools. Data Security: Protecting sensitive data is paramount. Implement robust security measures, access controls, and data governance policies to ensure data privacy and compliance with regulations. Data Integration: Data often resides in various systems and applications. Invest in data integration solutions to break down silos and create a unified view of your data for seamless analysis.
Operations: Data Governance: Establish clear guidelines and processes for data ownership, access, quality, and security. This ensures data integrity and consistency across the organization. Data Collection and Management: Define clear processes for capturing data from various sources, cleaning it for accuracy, and organizing it for efficient analysis. Data Monitoring and Reporting: Regularly monitor data quality and analyze key metrics to identify trends and track progress towards data-driven goals.

Let’s see how these four pillars play key roles in bringing the desired transformation.

Developing Data products

Data product: A self-contained unit that encapsulates the processing and storage of specific domain data. This data is prepared for analytical or data-intensive use cases and made readily available to other teams through designated access points (output ports).

This definition emphasizes these key aspects:

Self-contained: A data product functions as a complete unit, housing all necessary components for processing and storing data.

Domain-specific: It focuses on a particular area of expertise or business function within the organization.

Actionable insights: The data is processed and organized specifically for analytical or in-depth data analysis purposes.

Accessibility: The processed data is readily available to other teams or users through designated access points.

This shifts the perspective on data from being a byproduct to a valuable deliverable.

Data Product Teams

Data product teams are the driving force behind creating valuable tools and services that unlock the power of data.

What they build: Data products can take many forms, but some common examples include:

Dashboards and reports: These provide real-time or historical data visualizations, allowing users to monitor key metrics and trends.

Machine learning models: These are algorithms that learn from data and can be used for tasks like fraud detection, product recommendations, or churn prediction.

Data analysis platforms: These tools empower users to explore and analyze data independently, fostering self-service insights.

Data pipelines: These automate the process of collecting, transforming, and loading data for analysis.

Who's on the team:

Data product teams are cross-functional, bringing together a diverse set of skills:

Data Analysts: Clean, analyze, and interpret data to identify trends and patterns.

Data Engineers: Build and maintain the infrastructure for data collection, storage, and processing.

Data Scientists: Develop models and algorithms to extract valuable knowledge from data.

Product Managers: Lead the product development process, ensuring the data product meets user needs and delivers value.

Software Engineers: Develop the user interface and functionalities of the data product.

(Optional) Designers: Create user-friendly interfaces and visualizations for data products.

Data Architectures and Technologies

Architecture	Description	Pros	Cons
Cloud Native Data Lake	A scalable and secure platform that allows enterprises to: ingest any data from any system at any speed—even if the data comes from on-premises, cloud, or edge-computing systems; store any type or volume of data.	Lower TCO Simplify Management Speed up analytics Improved security and governance Optimized for very large scale data marts.	Flexible foundation but beginning to be seen as legacy.
Cloud Native Data Warehouse	An enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources, such as point-of-sale transactions, marketing automation, customer relationship management, and more.	Implements data transformation by SQL or UI centric tools like dbt, Matillion. Very good performance on vast majority of enterprise analytical workloads.	Can be expensive with high data volumes or complex queries. May lock you into a specific cloud provider.
Lakehouse	A hybrid architecture combining features of data lakes and data warehouses to support both analytics and ML/AI.	Flexibility of a data lake with the management capabilities of a data warehouse. Uses next gen storage tech like the Delta Lake. Supports ACID Handles complex batch data e.g. IoT	Less mature tooling but rapid pace of development
Data Mesh	Emerging type. A decentralized approach to data architecture, treating data as a product with domain-oriented ownership.	Promotes domain expertise, autonomy, and innovation. Improves data discoverability and quality.	Requires cultural change and strong governance. Overhead in coordinating between decentralized teams.
Data Fabric	Emerging type. Unified data environment across Enterprise’s data	Meant to solve multi cloud, heterogenous sources and infrastructure. Relies on data virtualization, rather than physically moving data across.	Less tooling and it’s an evolving landscape.

The data lake is the simplest of its type and has well documented reference architectures available. Cloud providers provide in the form of object storage (e.g. S3, ADLS), Spark along with distributed query engine (Amazon Athena, Big Query).
The cloud native data warehouse (Snowflakes, Synapse, BigQuery) are the dominant players and used for operational, custom reports. This usually puts SQL at the center of data engineering work.
The lakehouse is databricks innovation, it combines the capabilities of data lakes and data warehouses. It’s a big step forward compared with data lakes. Supports ACI, data versioning, lineage. Makes most sense for large data sets.
Data mesh is a decentralized approach. Data is delivered as curated, reusable data products.
Data fabric is more modern centralized approach. Solves the problem of data in multicloud environment.

Case Studies

These are some interesting case studies, that will give valuable insight into how some of these companies leverages data insights and made significant impact to their businesses:

Airbnb’s dynamic pricing:

https://hbr.org/2019/04/research-when-airbnb-listings-in-a-city-increase-so-do-rent-prices

JP Morgan Chase Fraud Detection:

https://d3.harvard.edu/platform-digit/submission/catching-fraudsters-with-machine-learning-feature-space/

Conclusion

The path we've explored - setting precise objectives, nurturing a culture rooted in data, choosing appropriate technologies, and upholding data governance—might resemble a complex musical piece. Yet, with the correct strategy, it can transform into a harmonious symphony propelling your organization ahead.

Connect with us to help in this transition and unlock the hidden power of data to stay competitive in the fast moving tech landscape.