How to develop a data governance strategy: 7 key steps

Copyright TechTarget - All rights reserved https://cyber.law.harvard.edu/rss/rss.html Techtarget Feed Generator en Tue, 28 Apr 2026 03:49:37 GMT https://www.techtarget.com/searchdatamanagement editor@techtarget.com <p>Without effective <a href="https://www.techtarget.com/searchdatamanagement/definition/data-governance">data governance</a>, growing volumes of data in IT systems are likely to become a disorganized morass, limiting their potential use. The risk of data misuse also increases due to lax controls.</p> <p>Conversely, well-governed data is consistent and accessible across the enterprise, enabling better-informed business decisions and more accurate analytics and AI applications. Companies are also less likely to experience serious data breaches or <a href="https://www.techtarget.com/searchdatamanagement/feature/Top-3-data-privacy-challenges-and-how-to-address-them">data privacy issues</a>, reducing their exposure to regulatory compliance problems, legal liabilities and reputational damage.</p> <p>As a result, developing a data governance strategy is a <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-governance-responsibilities-now-belong-in-the-C-suite">high-priority item on the C-suite agenda</a> in well-run companies. The chief data officer and other data leaders play central roles in that process, typically managing it and working closely with their business counterparts to create and then implement the governance strategy.</p> <p>Getting started with data governance is a big undertaking that commonly requires a substantial budget and significant resource commitments. It might be tempting to buy a strategy from a consultancy or a <a href="https://www.techtarget.com/searchdatamanagement/feature/15-top-data-governance-tools-to-know-about">data governance software vendor</a> that promises a packaged set of policies and tools. But for optimal alignment with your organization's business operations and processes, it's best to develop a tailored one in-house, using these seven steps.</p> <section class="section main-article-chapter" data-menu-title="1. Document existing data governance processes"> <h2 class="section-title"><i class="icon" data-icon="1"></i>1. Document existing data governance processes</h2> <p>Your company likely already has some data governance processes that should be incorporated into a formal strategy or replaced with new ones. Various people manage and oversee corporate data -- database administrators, backup admins, data architects and data quality analysts, for example. Document this by creating a directory of data assets and a corresponding list of managers and staff who are responsible or accountable for data.</p> <p>Don't be surprised if this exercise reveals some sobering, even shocking, oversights and gaps. The existing informal approach might reflect a messy reality, but getting a picture of current processes sets the stage for establishing a more strategic data governance program.</p> </section> <section class="section main-article-chapter" data-menu-title="2. Secure executive sponsorship for the data governance program"> <h2 class="section-title"><i class="icon" data-icon="1"></i>2. Secure executive sponsorship for the data governance program</h2> <p>Enlist senior business executives to sponsor, fund and promote the governance program. Their buy-in and top-down influence are critical because effective data governance requires participation and cooperation by departments and business units across the enterprise. But how can you win executive support for an initiative that might not show clear bottom-line benefits, at least not right away?</p> <p>Invoking fear, uncertainty and doubt is the usual default method. Horror stories of inaccurate data leading to bad business decisions, or of fines for failing to comply with <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-regulations-that-executives-should-know">data privacy and protection laws</a>, might be enough to convince business leaders to back a governance initiative. By itself, however, this defensive approach isn't the optimal way to secure long-term data governance commitments.</p> <p>Instead, combine it with a more forward-looking approach. Explain that data governance is largely informal now and that the company needs a framework with more defined processes. Emphasize that implementing one will not only help <a href="https://www.techtarget.com/searchdatamanagement/feature/Proactive-practices-for-data-quality-improvement">improve data quality</a> and meet regulatory requirements but also make the organization more functional and resilient.</p> <p>Also, address upfront an issue that often causes resentment among business stakeholders and users: the perception that data governance stifles creative uses of data. Data governance policies don't restrict innovation. It's quite the opposite: By creating a more reliable data foundation, effective governance enables new ideas to flourish while reducing the risk of improper data use.</p> <figure class="main-article-image full-col" data-img-fullsize="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f.png"> <img data-src="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f_mobile.png" class="lazy" data-srcset="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f_mobile.png 960w,https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f.png 1280w" alt="Visual that lists key reasons why organizations need a data governance program." height="283" width="560"> <figcaption> <i class="icon pictures" data-icon="z"></i>These are key reasons why organizations need to develop and implement an effective data governance strategy. </figcaption> <div class="main-article-image-enlarge"> <i class="icon" data-icon="w"></i> </div> </figure> </section> <section class="section main-article-chapter" data-menu-title="3. Improve data literacy and skills across the organization"> <h2 class="section-title"><i class="icon" data-icon="1"></i>3. Improve data literacy and skills across the organization</h2> <p>End users who understand the potential value of data and how to use it effectively are more likely to recognize the need to protect data assets and prevent misuse. To foster that understanding across the enterprise, develop <a href="https://www.techtarget.com/searchbusinessanalytics/tip/Develop-a-data-literacy-program-to-fit-your-company-needs">training to improve data literacy and skills</a> as part of the governance strategy.</p> <p>Enhanced data literacy also helps raise up data governance efforts in another way. End users often create duplicate reports, dashboards, spreadsheets and even entire databases because they don't know how to find existing ones. A <a href="https://www.techtarget.com/searchbusinessanalytics/feature/How-business-leaders-can-make-a-data-literate-culture-stick">data-literate culture</a> is better equipped to discover and reuse such assets, increasing efficiency and consistency and reducing the risk of data errors. This, in turn, helps streamline governance tasks.</p> </section> <section class="section main-article-chapter" data-menu-title="4. Create a virtual governance team at first, then formalize roles"> <h2 class="section-title"><i class="icon" data-icon="1"></i>4. Create a virtual governance team at first, then formalize roles</h2> <p>It's too much to ask a company to reorganize upfront to improve data governance. Instead, start by constructing some organizational scaffolding around the existing ad hoc data structures. Identify the key roles currently involved in governance processes and create a virtual team to improve coordination and collaboration.</p> <p>As governance becomes more formalized, new roles will emerge. A data governance manager or vice president commonly leads a team of data governance specialists who coordinate the program. Some business or IT workers will become data stewards, with direct responsibility for implementing governance policies in particular data sets. That can be a full-time or part-time role, depending on the organization's size and the complexity of its governance needs.</p> <p>Establishing a data governance council or committee is also a must. It typically includes the following members:</p> <ul class="default-list"> <li>Representatives from all departments and business units.</li> <li>IT, legal and compliance executives.</li> <li>Data stewards or others with data ownership responsibilities.</li> </ul> <p>The council sets data governance policies, creates common data standards, prioritizes governance projects and resolves data-related disputes, among other responsibilities. Having one ensures there's broad input on data governance controls and helps pave the way for enterprise-wide adoption of the governance strategy.</p> </section> <section class="section main-article-chapter" data-menu-title="5. Decide how to measure the governance program's effectiveness"> <h2 class="section-title"><i class="icon" data-icon="1"></i>5. Decide how to measure the governance program's effectiveness</h2> <p>For a data governance program to gain and maintain support in an organization, it's crucial to measure its effectiveness -- and show how it benefits the company. But effective data governance might not tangibly affect business performance. It also isn't easy to calculate KPIs for reduced business risks, such as avoiding regulatory fines or reputational damage.</p> <p>Instead, identify <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-governance-metrics-Data-quality-data-literacy-and-more">key data governance metrics to track</a> and tie them to business benefits such as improved decision-making, optimized business processes and stronger privacy protections. For example, use metrics on data accuracy, completeness, consistency, timeliness and duplication to <a href="https://www.techtarget.com/searchdatamanagement/tip/6-dimensions-of-data-quality-boost-data-performance">monitor data quality levels and document improvements</a> that make data more reliable for analytics and AI applications.</p> <p>Metrics also help identify governance issues. Tracking how often users access data provides insight into whether it's being used effectively. Low usage might indicate a lack of awareness or accessibility. An increase in the overall number of analytics users is a marker of the governance program's success, but further training might be required if metrics also show new users are creating reports and dashboards that duplicate existing ones.</p> </section> <section class="section main-article-chapter" data-menu-title="6. Prioritize data governance for new AI applications"> <h2 class="section-title"><i class="icon" data-icon="1"></i>6. Prioritize data governance for new AI applications</h2> <p>Governing data for AI applications is now a key consideration for data leaders and governance teams. Data readiness is <a href="https://www.techtarget.com/searchdatamanagement/tip/Experts-share-practices-to-overcome-AI-data-readiness">critical to successful deployments</a> of machine learning, generative AI and agentic AI tools. Effective data governance ensures that AI models are built on a solid foundation of high-quality data and don't use it in ways that violate privacy and ethics policies.</p> <p>For example, retrieval-augmented generation (RAG) frameworks pose specific data governance challenges in enterprise AI applications. RAG enables large language models (LLMs) to <a href="https://www.techtarget.com/searchenterpriseai/opinion/How-RAG-unlocks-the-power-of-enterprise-data">directly draw from enterprise data</a> -- it retrieves relevant documents or records from internal knowledge bases and uses them to generate responses to user queries.</p> <p>From a governance perspective, successful RAG use requires not only accurate, up-to-date data but also <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-lineage-documentation-imperative-to-data-quality">data lineage documentation</a> that traces an LLM's output back to the original data sources for explainability and performance auditing. Well-managed access control is also necessary. Define and enforce user permissions in the RAG framework to prevent end users from inadvertently seeing data they can't access in conventional analytics applications.</p> <p>In addition to incorporating AI-related governance processes into your data governance strategy, <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-and-AI-governance-must-team-up-for-AI-to-succeed">align them with an AI governance program</a> that monitors and controls AI deployments more broadly. Data governance and AI governance are separate functions, but they go hand in hand and should be tightly integrated.</p> </section> <section class="section main-article-chapter" data-menu-title="7. Select technologies that fit the data governance strategy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>7. Select technologies that fit the data governance strategy</h2> <p>Various technologies can be used in data governance initiatives. Data governance software automates program management tasks, such as policy development, process documentation, data classification and workflow management. Data catalogs provide a <a href="https://www.techtarget.com/searchdatamanagement/answer/What-steps-are-key-to-building-a-data-catalog">unified inventory of data assets</a>, with built-in governance, data lineage and data curation features. Analytics catalogs help users find relevant dashboards, reports and data visualizations and provide guidance on how to use them appropriately.</p> <p>For data processing, newer data lakehouse architectures combine the raw data storage of a data lake with the structured, governed repository of a data warehouse. Collapsing the separation between those two platforms streamlines data management and governance work and provides a single system that supports BI, advanced analytics and AI applications.</p> <p>But don't build a data governance strategy around specific technologies. Selecting tools that align with the strategy and support its goals will put the governance program on track to deliver the expected benefits, rather than running into a technology dead end that undermines the program.  </p> <p><b>Editor's note:</b> <i>This article was updated in April 2026 for timeliness and to add new information.</i></p> <p><em>Donald Farmer is a data strategist with 30-plus years of experience, including as a product team leader at Microsoft and Qlik. He advises global clients on data, analytics, AI and innovation strategy, with expertise spanning from tech giants to startups.</em></p> </section> A strong data governance strategy enables more effective data use and helps prevent financial, legal and reputational problems. Follow these steps to develop one. https://cdn.ttgtmedia.com/rms/onlineimages/legal_g1169668297.jpg https://www.techtarget.com/searchdatamanagement/tip/6-key-steps-to-develop-a-data-governance-strategy Fri, 24 Apr 2026 17:01:00 GMT How to develop a data governance strategy: 7 key steps <p>Even with a governance program in place, organizations can still fall short.</p> <p>Lack of data ownership is a widespread issue. Many organizations are finding that vast amounts of data lack an assigned owner. According to the <a target="_blank" href="https://cpl.thalesgroup.com/about-us/newsroom/ai-the-new-insider-threat-facing-organizations" rel="noopener">2026 Data Threat Report</a> from global technology company Thales, only 34% of organizations know where all their data is stored, and only 39% can fully classify their data.</p> <p>Executives must identify unowned datasets within their organizations and assign ownership to ensure enterprise data is appropriately governed and secured. Not doing so puts the organization at financial and reputational risk.</p> <section class="section main-article-chapter" data-menu-title="Common data ownership oversights"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Common data ownership oversights</h2> <p>Quality data has become a critical asset for organizations, especially for <a href="https://www.techtarget.com/searchenterpriseai/feature/9-data-quality-issues-that-can-sideline-AI-projects">automation, analytics and AI</a>. Data owners, data stewards and chief data officers are responsible for ensuring that data meets established standards and <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-lineage-documentation-imperative-to-data-quality">has trackable lineage</a>.</p> <p>But there are still types of data that organizations fail to capture and govern effectively, including:</p> <ul class="default-list"> <li><b>Unstructured content from communication channels.</b> Transcripts from messaging and collaboration apps, meeting transcripts and recordings, and content from emails and social media exchanges all lack ownership but might contain critical business information.</li> <li><b>Data generated by shadow IT systems.</b> About 70% of CIOs believe that the business units in their organizations deploy unsanctioned tech, according to Flexera's <a href="https://www.flexera.com/resources/reports/ITV-REPORT-IT-Priorities" target="_blank" rel="noopener">2026 IT Priorities Report</a>. That creates shadow data -- that is, data generated and stored outside the purview of the organization's IT and security controls as well as its data governance program.</li> <li><b>Data developed in sandboxes and for temporary projects.</b> This spans from developmental databases and test environments to intermediate datasets to <a href="https://www.techtarget.com/searchenterpriseai/tip/Explore-the-role-of-training-data-in-AI-and-machine-learning">training data for AI and machine learning</a> models.</li> <li><b>Siloed data.</b> Many data types fall into this category, including data stored in individual desktop folders, temporary reports, disconnected spreadsheets and data in legacy systems. Some also put vendor and third-party data in this category. Dark data also falls into this category, even though the business no longer uses it.</li> <li><b>Operational and machine data.</b> Various data types belong here, ranging from systems logs, IoT sensor data and API payloads to metadata. These datasets often lack a single owner and are overlooked in management and governance.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="Unowned data governance challenges"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Unowned data governance challenges</h2> <p>Establishing governance and ownership over unowned data is challenging. Experts noted that these data types are hard to classify, easy to overlook and span multiple business functions. Consequently, no single business leader claims or is assigned ownership. Or it might be that data ownership lies with many or all leaders, which lacks accountability.</p> <p>Regardless, unowned data puts the organization at risk of costly data breaches, <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-regulations-that-executives-should-know">regulatory violations</a> and inaccurate outputs from analytics, automation and AI initiatives due to poor data quality.</p> <p>With such consequences in mind, many executives seek to improve their organization's overall data management and governance. According to Workvia's <a target="_blank" href="https://www.workiva.com/sites/workiva/files/pdfs/workiva-2026-exec-benchmark-survey-en.pdf" rel="noopener">2026 Executive Benchmark Survey</a>, business leaders ranked "strengthening data governance" as the second-highest priority for digital transformation projects, behind automating data collection and validation.</p> <p>Similarly, Informatica's <a href="https://www.informatica.com/about-us/news/news-releases/2026/01/20260127-new-global-cdo-report-reveals-data-governance-and-ai-literacy-as-key-accelerators-in-ai-adoption.html" target="_blank" rel="noopener">CDO Insights 2026 Report</a> found that 86% of 600 surveyed data leaders planned to increase data management investments -- and 41% seek to boost data and AI governance. This is in response to concerns about poor data quality affecting business objectives.</p> <p>Ultimately, data management and governance won't succeed if they can't be accounted for. Assigning ownership to unowned or over-shared data will help organizations strengthen data governance and boost data quality -- and their own trustworthiness.</p> <p><em>Mary K. Pratt is an award-winning freelance journalist with a focus on covering enterprise IT and cybersecurity management.</em></p> </section> Organizations can't claim to have good data governance when they still have unowned data. Assigning ownership to siloed and dark data is critical to enterprise success. https://cdn.ttgtmedia.com/rms/onlineimages/folder-files07.jpg https://www.techtarget.com/searchdatamanagement/tip/The-data-ownership-blind-spots-putting-organizations-at-risk Fri, 24 Apr 2026 14:20:00 GMT The data ownership blind spots putting organizations at risk <p>Data is one of an organization's most valuable assets. But without a comprehensive data strategy as a foundation, it often becomes fragmented, inconsistent and difficult to access or trust for business decision-making.</p> <p>An effective enterprise data strategy establishes a structured approach to managing, governing and using data in alignment with business objectives. That enables companies to <a href="https://www.techtarget.com/searchdatamanagement/opinion/Turning-data-into-a-strategic-advantage">unlock greater value from their data assets</a> through improved decision-making, optimized business processes and increased operational efficiency. It also helps them boost innovation and gain a sustainable competitive advantage over less data-driven rivals.</p> <p>The data strategy shouldn't focus on implementing new technologies. That comes later, driven by the strategy. Instead, it should set the direction for data management processes, address common data-related challenges and build the capabilities needed to <a href="https://www.techtarget.com/searchdatamanagement/opinion/Data-management-and-governance-key-to-successful-AI-use">support planned data use</a> across the enterprise.</p> <p>Follow these 12 steps to develop a data strategy that accomplishes those things and positions your organization to realize long-term business benefits.</p> <section class="section main-article-chapter" data-menu-title="1. Define clear business objectives for data initiatives"> <h2 class="section-title"><i class="icon" data-icon="1"></i>1. Define clear business objectives for data initiatives</h2> <p>A successful enterprise data strategy is grounded in close alignment between data initiatives and business goals. Data management and analytics efforts should directly support priorities such as enabling better-informed decision-making, enhancing customer experience, improving business operations, <a href="https://www.techtarget.com/searchdatamanagement/feature/Modern-data-architectures-as-a-risk-management-strategy">reducing risks</a> and fostering innovation.</p> <p>To achieve this alignment, work closely with senior executives and business managers to identify critical objectives that depend on effective data use. Engaging with key stakeholders at the outset ensures the data strategy addresses real business needs and guides appropriate technology choices to help meet them. Data initiatives tied to measurable business outcomes are more likely to gain executive support and sustained investment in the resources required for long-term success.</p> </section> <section class="section main-article-chapter" data-menu-title="2. Assess the existing data landscape in your organization"> <h2 class="section-title"><i class="icon" data-icon="1"></i>2. Assess the existing data landscape in your organization</h2> <p>Next, get a complete understanding of the organization's current data environment. A comprehensive assessment documents existing technologies, capabilities, challenges and opportunities for improvement. The data management team should conduct it with clear visibility across data domains and business processes enterprise-wide.</p> <p>As part of the assessment, review source systems, data platforms, integration processes, governance structures and analytics applications, as well as how data flows between IT systems in different departments or business units. This uncovers issues such as data silos, inconsistent data definitions, limited metadata visibility and restricted access to relevant data. Identifying these gaps enables data leaders to prioritize initiatives and create a realistic roadmap for implementing the data strategy.</p> </section> <section class="section main-article-chapter" data-menu-title="3. Specify the desired state for data management and analytics"> <h2 class="section-title"><i class="icon" data-icon="1"></i>3. Specify the desired state for data management and analytics</h2> <p>Once the current-state assessment is complete and the results have been evaluated, articulate what works well and where changes are needed in data management and analytics processes. Defining the desired state clarifies what the organization can achieve through those changes. This vision should be based on the previously identified business imperatives for each data domain and function.</p> <p>As part of this step, set data quality expectations and outline plans to harmonize core data management processes, such as data integration, metadata management and master data management. Doing so ensures consistency across systems and reliable access to <a href="https://www.techtarget.com/searchdatamanagement/opinion/Trusted-data-is-the-foundation-of-data-driven-decisions-GenAI">relevant and trustworthy data</a>.</p> </section> <section class="section main-article-chapter" data-menu-title="4. Identify and prioritize critical data domains"> <h2 class="section-title"><i class="icon" data-icon="1"></i>4. Identify and prioritize critical data domains</h2> <p>The strategic value of data varies. While an enterprise data strategy by definition should address all data domains, focus initial implementation efforts on the domains and associated data sets that are most critical to business operations and decision-making.</p> <p>Identifying and prioritizing the highest‑value data domains enables data leaders to direct resources to areas where data management and analytics improvements will have the greatest business impact. In a retailer, for example, improving customer data quality enables more accurate analytics for targeted marketing and better customer service. Focusing on high‑value areas also helps demonstrate the data strategy's value and build momentum toward a more <a href="https://www.techtarget.com/searchdatamanagement/tip/Use-these-steps-to-successfully-build-your-data-culture">data‑centric culture</a>.</p> </section> <section class="section main-article-chapter" data-menu-title="5. Create an implementation roadmap"> <h2 class="section-title"><i class="icon" data-icon="1"></i>5. Create an implementation roadmap</h2> <p>After defining what your organization aims to achieve with data to support business priorities and what's required to do so, create an implementation roadmap that details how it will get there. A well-designed roadmap sequences data initiatives over time in a way that's achievable and measurable.</p> <p>That requires balancing ambition with realism to enable sustained, disciplined progress on the data strategy rather than a series of disconnected short-term projects -- or, worse, overpromising on planned deployments. The roadmap should also connect long-term goals, such as becoming more data-driven or AI-enabled, to concrete steps across data management and analytics processes.</p> </section> <section class="section main-article-chapter" data-menu-title="6. Develop data principles and strategic guardrails"> <h2 class="section-title"><i class="icon" data-icon="1"></i>6. Develop data principles and strategic guardrails</h2> <p>Incorporate data principles and strategic guardrails into the data strategy so they actively shape decisions on data management and use, rather than being abstract guidelines. Foundational principles -- such as treating data as an enterprise asset, ensuring it's accurate and accessible, and establishing a single source of truth through transparent data management practices -- should directly inform the data operating model and architecture. This drives data consistency, reuse and trust across the organization.</p> <p>Strategic guardrails are operational constraints and requirements in areas such as privacy, security, <a href="https://www.techtarget.com/searchbusinessanalytics/feature/Why-ethical-use-of-data-is-so-important-to-enterprises">ethical data use</a>, data quality and data platform design. Embed them in the data strategy as part of data governance policies and the implementation roadmap. Aligning suitable guardrails with the execution of data initiatives provides clear direction on appropriate data use, reduces data-related risks and enables BI, data science and business teams to innovate confidently within well-defined boundaries.</p> </section> <section class="section main-article-chapter" data-menu-title="7. Build a data governance framework and assign data ownership"> <h2 class="section-title"><i class="icon" data-icon="1"></i>7. Build a data governance framework and assign data ownership</h2> <p>A strong data governance program is a <a href="https://www.techtarget.com/searchdatamanagement/tip/6-key-components-of-a-successful-data-strategy">critical component of a data strategy</a>. Effective data governance ensures that data remains consistent and reliable and that it's managed and used properly. Without it, various problems can arise. For example, different departments might create conflicting data definitions or data quality might deteriorate, compromising business decisions due to incomplete or inaccurate information.</p> <p>Include <a href="https://www.techtarget.com/searchdatamanagement/tip/5-benefits-of-building-a-strong-data-governance-strategy">implementing the data governance framework</a> as a foundational item in the data strategy's roadmap. The strategy should also detail expectations for managing data throughout its lifecycle and the role of data governance in supporting business objectives. Additionally, work with business stakeholders to assign ownership of data assets to appropriate individuals or teams and task them with ensuring the data they oversee is managed and used in accordance with governance policies.</p> </section> <section class="section main-article-chapter" data-menu-title="8. Design an enterprise data architecture"> <h2 class="section-title"><i class="icon" data-icon="1"></i>8. Design an enterprise data architecture</h2> <p>A <a href="https://www.techtarget.com/searchdatamanagement/definition/What-is-data-architecture-A-data-management-blueprint">data architecture</a> provides the technical foundation for managing and delivering data. It defines and visualizes how data is processed, integrated, stored and accessed across systems. However, in many organizations, the existing data architecture has been developed over time, often in a piecemeal fashion without an enterprise-wide focus. As a result, redundancies and gaps in the architecture create challenges with data access and use.</p> <p>To address these issues, design an enterprise data architecture as part of the data strategy. In addition to a high-level architectural blueprint, it should include artifacts such as data models, data flow diagrams and documents that map data use to business processes. A well-designed data architecture guides data management processes, helps teams identify data challenges and supports both operational reporting and advanced analytics.</p> </section> <section class="section main-article-chapter" data-menu-title="9. Implement security, privacy and regulatory compliance controls"> <h2 class="section-title"><i class="icon" data-icon="1"></i>9. Implement security, privacy and regulatory compliance controls</h2> <p>Protecting the ever-increasing volumes of data that organizations collect and use is critical to avoiding business problems. In addition to strategic guardrails that set high-level boundaries on data management and use, a data strategy must include specific controls to <a href="https://www.techtarget.com/searchdatamanagement/feature/Top-3-data-privacy-challenges-and-how-to-address-them">mitigate data security and privacy risks</a>. For example, ensure that only authorized users can access sensitive data and that potential security threats can be detected and addressed quickly through predefined incident response plans.</p> <p>Regulatory compliance is also a broader issue now due to the <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-regulations-that-executives-should-know">growing number of data protection laws</a> that require responsible management of personal information and transparency about how data is used. Integrate compliance mechanisms into the data strategy to help reduce legal risks and maintain trust with customers and business partners.</p> </section> <section class="section main-article-chapter" data-menu-title="10. Enable data accessibility and increased data literacy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>10. Enable data accessibility and increased data literacy</h2> <p>Making trusted data accessible to the people who need it is a core objective of an enterprise data strategy. Data access is no longer restricted to technical specialists. A modern data strategy supports controlled, governed access for business users and data analysts through user-friendly dashboards, self-service analytics tools and <a href="https://www.techtarget.com/searchdatamanagement/answer/What-steps-are-key-to-building-a-data-catalog">centralized data catalogs</a>.</p> <p>However, data accessibility alone isn't enough. Increased data literacy is also required across the organization to maximize the business value derived from data assets. As part of the data strategy, <a href="https://www.techtarget.com/searchbusinessanalytics/tip/Develop-a-data-literacy-program-to-fit-your-company-needs">develop a data literacy program</a> that sets expectations for workers and includes training to help them become more data-literate.</p> </section> <section class="section main-article-chapter" data-menu-title="11. Build in support for BI, advanced analytics and AI applications"> <h2 class="section-title"><i class="icon" data-icon="1"></i>11. Build in support for BI, advanced analytics and AI applications</h2> <p>In the past, data strategies often focused primarily on delivering data for use in BI and reporting applications. But now they must also focus on the data needed for expanding deployments of advanced analytics and AI applications in companies.</p> <p>Build support for techniques and tools such as predictive analytics, machine learning and both generative AI and agentic AI into the enterprise data strategy. Used effectively, they help organizations identify patterns in large data sets, forecast trends, explore data more efficiently and optimize or automate business processes. However, they <a href="https://www.techtarget.com/searchdatamanagement/opinion/The-future-of-AI-depends-on-better-data-not-bigger-models">depend on the strong data foundation</a> that a data strategy provides.</p> </section> <section class="section main-article-chapter" data-menu-title="12. Define metrics to track and evolve the data strategy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>12. Define metrics to track and evolve the data strategy</h2> <p>A data strategy should evolve over time as business priorities, data sets, technologies and regulations change -- and as problems are identified. To guide this evolution, define <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-governance-metrics-Data-quality-data-literacy-and-more">KPIs and other metrics</a> to track the effectiveness of data initiatives. Include ones on things such as data quality, governance activities and data availability, security and use.</p> <p>Monitoring them enables data and business leaders to evaluate progress on initiatives and identify areas for improvement. Regular reviews and continuous refinement of the data strategy ensure that it remains aligned with the organization's needs and continues to deliver business value. Spell out the need for that upfront, when setting expectations for the strategy, so it isn't a surprise to anyone.</p> <p><em>Anne Marie Smith, Ph.D., is an information management professional and consultant with broad experience across industries. She has also designed and delivered numerous data management courses and educational programs.</em></p> </section> Here are 12 to-do items for data leaders developing a data strategy to help their organization use data more effectively for analytics and business decision-making. https://cdn.ttgtmedia.com/visuals/search400/iseries_database_manage/search400_article_014.jpg https://www.techtarget.com/searchdatamanagement/tip/Developing-an-enterprise-data-strategy-10-steps-to-take Fri, 17 Apr 2026 15:50:00 GMT How to develop an enterprise data strategy: 12 key steps <p>Every business today is a data business. From the corner store tracking stock levels to the multinational manufacturer predicting market trends and shipping costs worldwide, all businesses run on data.</p> <p>Specifically, they run on many types of data. For example, businesses of all kinds have transaction, reference and customer relationship data. We might also have industry-specific and external data, as well as metadata describing their formats and uses. Often, we integrate all these data types to create specialized analytics data sets. A well-planned data strategy keeps this complex ecosystem in order, with a <a href="https://searchdatamanagement.techtarget.com/tip/5-principles-of-a-well-designed-data-architecture">strong data architecture</a> as its foundation.</p> <section class="section main-article-chapter" data-menu-title="Why do you need a data strategy?"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why do you need a data strategy?</h2> <p>A data strategy defines long-term objectives for how an organization uses data, along with the policies and practices that support them. To be successful, a data strategy must cover all data use cases – not just technical processes for <a href="https://searchdatamanagement.techtarget.com/definition/data-management">data management</a> and analytics, but also the human element.</p> <p>No modern business can leave the management, security and use of such an important corporate asset to individual data architects or developers. A comprehensive data strategy, with broad involvement and support, ensures data is managed well and used effectively.</p> <p>Data priorities differ across organizations, shaped by management strategies and business goals, so there's no generic template to follow. But there are six critical components every data strategy must include.</p> </section> <section class="section main-article-chapter" data-menu-title="1. Data"> <h2 class="section-title"><i class="icon" data-icon="1"></i>1. Data</h2> <p>This is the most fundamental component, of course. But all the advice that follows will be of no help if your data isn't safely stored and secured, well-maintained and ready for use. The strategic value of your data must be built on a solid base of enterprise data management. That includes integrating and processing your data, <a href="https://www.techtarget.com/searchdatamanagement/tip/6-dimensions-of-data-quality-boost-data-performance">validating its quality</a>, governing its use and auditing the processes that affect it.</p> <p>Once these basics are in place, I always recommend an <a href="https://searchdatamanagement.techtarget.com/answer/What-steps-are-key-to-building-a-data-catalog">enterprise data catalog</a> as a critical component of a data strategy. You can't strategize around data if you don't know what data you have. Data catalog tools are particularly useful for making data available to business users by providing detailed, descriptive metadata. Sometimes IT managers want to map their systems -- to know what data they have and where it resides. The IT team can create its own simplified data catalog for such needs.</p> <p>The key questions are always the same. What data do I have? Where is it? Who can use it?</p> </section> <section class="section main-article-chapter" data-menu-title="2. Tools"> <h2 class="section-title"><i class="icon" data-icon="1"></i>2. Tools</h2> <p>Data catalog tools are provisioned by IT and data management teams who know how to use the various features in <a href="https://www.techtarget.com/searchdatamanagement/feature/16-top-data-catalog-software-tools-to-consider-using">data catalog software</a>, set them up and deploy them. We can make a useful distinction between tools provided by IT and tools adopted by end users. Both play an important role in a data strategy, complementing rather than contradicting each other.</p> <p>Data management tools are almost always the domain of IT. There are some lightweight data quality and data integration tools designed for business users, but data management remains largely a behind-the-scenes function.</p> <p>IT often also deploys the BI tools used to create data visualizations, dashboards and reports. But data and business analysts might have their own preferences and choose different tools. That can work well so long as we put controls in place to <a href="https://searchbusinessanalytics.techtarget.com/feature/Data-governance-framework-key-to-analytics-success">govern data access and usage</a>. Likewise, data scientists might feel most comfortable using tools they already mastered or that support certain analytics methodologies.</p> <p>In the past, most IT teams tried to prevent the use of unsanctioned, non-standard tools. Now, just as we've adapted to bring-your-own-device, analytics specialists commonly bring their own favored applications. A good data strategy embraces that diversity but with sensible limits. In this case, we can ask another question: What tools are appropriate to use? Enabling a data analyst to use a <a href="https://searchbusinessanalytics.techtarget.com/definition/self-service-business-intelligence-BI">self-service BI</a> application to build some dashboards is reasonable; allowing someone to build their own data warehouse beyond their skills and authority is not.</p> <figure class="main-article-image full-col" data-img-fullsize="https://www.techtarget.com/rms/onlineImages/data_management-key_stages_data_strategy-f.png"> <img data-src="https://www.techtarget.com/rms/onlineImages/data_management-key_stages_data_strategy-f_mobile.png" class="lazy" data-srcset="https://www.techtarget.com/rms/onlineImages/data_management-key_stages_data_strategy-f_mobile.png 960w,https://www.techtarget.com/rms/onlineImages/data_management-key_stages_data_strategy-f.png 1280w" alt="Key stages of the data strategy development process" height="267" width="560"> <figcaption> <i class="icon pictures" data-icon="z"></i>These are the four main phases of developing a data strategy, according to Donna Burbank of Global Data Strategy. </figcaption> <div class="main-article-image-enlarge"> <i class="icon" data-icon="w"></i> </div> </figure> </section> <section class="section main-article-chapter" data-menu-title="3. Analytics techniques"> <h2 class="section-title"><i class="icon" data-icon="1"></i>3. Analytics techniques</h2> <p>Just as we use various analytics tools depending on our needs, we also employ a variety of analytics techniques. Data visualization is a common example. We might also find uses for predictive analytics, text analytics, sentiment analysis and cluster analysis, to name a few advanced analytics techniques. They can be powerful and useful, but also need careful oversight. Without it, we might run afoul of <a href="https://searchdatamanagement.techtarget.com/definition/data-governance">data governance</a> and privacy laws.</p> <p>Predictive analytics, for example, might show business value in optimizing equipment maintenance cycles. That's an uncontroversial use. But predictive techniques could also be used to help automate hiring or manage marketing promotions. In those cases, employees and consumers might have concerns about the reliability, fairness or openness of the process.</p> <p>A data strategy must recognize that governing only data and tools might not suffice. We need to understand -- and train people to understand -- that not all analytics techniques are neutral. Some use cases, especially those involving personally identifiable information, won't be justified by their business value alone.</p> </section> <section class="section main-article-chapter" data-menu-title="4. Collaboration"> <h2 class="section-title"><i class="icon" data-icon="1"></i>4. Collaboration</h2> <p>In modern businesses, data use is typically more collaborative than in the past. Increased data literacy and easier-to-use tools mean more people can participate in analytics, as well as technical fields like <a href="https://searchbusinessanalytics.techtarget.com/definition/data-preparation">data preparation</a> and data quality.</p> <p>Even closely controlled processes, such as data governance and primary data definition development, can be crowdsourced. For example, doing so can ensure that product names, error codes and managed processes reflect reality on the shop floor in a manufacturing company. Collaboration on primary data can also avoid that most frustrating customer service response: "There's no code for that."</p> <p>Collaborative tools are also being used more, including file sharing, enterprise chat, messaging and video conferencing. Human beings are compulsive collaborators. We constantly share, discuss and debate with others. If collaboration isn't planned for, it will happen anyway -- unplanned.</p> <p>Consider the role of data and analytics in your organization's business decisions and identify processes that involve engagement within and beyond teams. Use that insight to support the ability to share and comment on dashboards, reports and data visualizations.</p> <p>For example, some <a href="https://searchbusinessanalytics.techtarget.com/feature/How-to-evaluate-and-select-the-right-BI-analytics-tool">BI and analytics tools</a> enable multiple users to annotate visualizations. Increasingly, they also integrate with chat and messaging apps. Even simple file sharing can be effective, especially when supported by enterprise-class scalability and security features.</p> </section> <section class="section main-article-chapter" data-menu-title="5. Documentation and auditing"> <h2 class="section-title"><i class="icon" data-icon="1"></i>5. Documentation and auditing</h2> <p>In describing these data strategy components, I've emphasized the need to balance IT control with end users' freedom to do self-service when appropriate.</p> <p>To find this balance, our strategic goals must be well documented. Successful data strategies are built on the ability to answer four questions about any element of the plan and any resource -- data, tools, etc. -- it incorporates.</p> <ul class="default-list"> <li>What is appropriate?</li> <li>What is approved?</li> <li>What is the purpose?</li> <li>What is the governance policy?</li> </ul> <p>With good documentation of both the data strategy and the underlying data architecture, we can answer these questions before any new project or initiative. We should also be able to look back at any project and answer them retrospectively. By doing so, we put ourselves in a good position to audit how the data strategy is working. It can also help us assess compliance with data governance policies and other <a href="https://searchdatamanagement.techtarget.com/feature/Data-model-design-tips-to-help-standardize-business-data">internal data standards</a>.</p> <div class="youtube-wrapper"> <iframe width="560" height="315" src="https://www.youtube.com/embed/BqdPuwvwPk4?rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </div> </section> <section class="section main-article-chapter" data-menu-title="6. People"> <h2 class="section-title"><i class="icon" data-icon="1"></i>6. People</h2> <p>The two most important elements of your data strategy are the bookends of this list: data and people. Organizations increasingly look for <a href="https://www.techtarget.com/searchbusinessanalytics/tip/Data-literacy-training-requires-a-dual-approach">data literacy and analytics skills</a> in new business hires. Almost every business school now teaches basic data analytics.</p> <p>Data scientists remain in high demand, though the role has evolved significantly in recent years. AI and machine learning have reshaped what organizations require. The priority now is professionals who can not only analyze data, but also build and govern the systems that act on it.</p> <p>You should also think carefully about IT and data management in your staffing and hiring. With so much technology running in the cloud and systems more robust than ever, it's tempting to think IT merely has to keep the lights on. It's not true. High availability, disaster recovery, meeting service-level agreements, supporting new business requirements and regulatory demands all fall into IT's domain.</p> <p>Data architects, data integration developers, data engineers, database administrators and other <a href="https://searchdatamanagement.techtarget.com/feature/Data-management-roles-Data-architect-vs-data-engineer-others">data management professionals</a> also play key roles in meeting business needs. An IT staff that is savvy about the business is a great strategic advantage. That caliber of IT staff needs recognition and leadership support as much as any other role.</p> </section> <section class="section main-article-chapter" data-menu-title="How to implement an effective data strategy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How to implement an effective data strategy</h2> <p>These six key components aren't a complete guide to <a href="https://www.techtarget.com/searchdatamanagement/tip/Developing-an-enterprise-data-strategy-10-steps-to-take">developing a data strategy</a>. You also must consider broader concerns, such as budgets, competition, innovation, marketing plans, staffing policies and legal frameworks.</p> <p>But you can <a href="https://theodi.org/article/data-strategy-how-an-ecosystem-approach-can-help-shape-your-vision/">apply this thinking broadly</a>. For example, your staffing plan could include guidelines for making better use of data and analytics based on strategic priorities. Product innovation is increasingly driven by data on customer feedback, user behavior and market trends.</p> <p>Implementing a data strategy requires understanding your entire organization's strategic goals. From there, break down the role of data, how it will be managed and used and apply it consistently across production, finance, marketing and HR. The result will be a data strategy that is workable and flexible for ever-changing business pressures and needs.</p> <p><strong>Editor's note:</strong> <em>This article was republished in April 2026 to improve the reader experience. </em></p> <p><em>Donald Farmer is a data strategist with 30+ years of experience, including as a product team leader at Microsoft and Qlik. He advises global clients on data, analytics, AI and innovation strategy, with expertise spanning from tech giants to startups. He lives in an experimental woodland home near Seattle.</em></p> </section> These six elements are essential parts of an enterprise data strategy that will help meet business needs for information when paired with a solid data architecture. https://cdn.ttgtmedia.com/rms/onlineimages/strategy_a200792738.jpg https://www.techtarget.com/searchdatamanagement/tip/6-key-components-of-a-successful-data-strategy Wed, 15 Apr 2026 09:00:00 GMT 6 key components of a successful data strategy <p>AI made semantic search mainstream. Now, enterprise reality is forcing a strategic refinement.</p> <p>Vector search became a common requirement after the release of ChatGPT and the rise of generative AI chatbots, and it's now a standard feature across many database platforms. But increasingly, vector search is no longer the sole decision point for data leaders looking for effective search and retrieval frameworks to support AI applications. As usage expands into business‑critical workflows, some implementations can strain under relevance gaps and rising operational and governance overhead. What matters now is hybrid search, which combines semantic similarity with keyword precision, because enterprise queries often require both meaning and exact terms. That shift is pushing many organizations to update their search approaches to match real business use.</p> <section class="section main-article-chapter" data-menu-title="Why early vector search deployments break down"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why early vector search deployments break down</h2> <p>The practical problem is that many first-generation deployments struggle <a href="https://www.techtarget.com/searchdatamanagement/tip/Assemble-the-layers-of-big-data-stack-architecture">as AI initiatives expand and data volumes grow</a>. A system might handle similarity search reasonably well but stumble on terms specific to the business, such as product names, acronyms, customer identifiers, error codes and policy language. As those misses add up, it's time to reassess the type of search and retrieval architecture needed to fully support a more mature environment.</p> <p>Hybrid search, also referred to as hybrid retrieval, sits at the center of these discussions because it reflects how enterprise search for AI applications works. Some queries depend on exact matches, <a href="https://www.techtarget.com/searchdatamanagement/opinion/Why-data-semantics-matters-for-context-aware-systems">others on semantic similarity</a>, and many require both. Hybrid search runs full-text and vector queries in parallel and blends the results into a single ranked list.</p> <p>For database buyers, it's clear that hybrid search is the baseline. Standalone vector database products still have their place, and many also now support full-text search. But many teams can <a href="https://www.techtarget.com/searchdatamanagement/feature/AI-data-governance-guidance-that-gets-you-to-the-finish-line">store and query vector embeddings</a> in their current systems, including core database engines and managed services. More and more, platform differentiation comes down to relevance, including filters that narrow results to the right scope and reranking capabilities that push the best candidates to the top of search results.</p> </section> <section class="section main-article-chapter" data-menu-title="Three ways to modernize a retrieval framework"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Three ways to modernize a retrieval framework</h2> <p>When a framework change is needed to help the business gain the expected benefits from AI applications, organizations have three options.</p> <h3>1. Extend the existing platform</h3> <p>The first path is to extend an existing data platform. This is usually the right move when vector-based retrieval is primarily an upgrade to the infrastructure the organization already uses and trusts.</p> <p>The goal is to keep retrieval within the existing data stack while improving search support for AI workloads. MongoDB Atlas, Databricks, Snowflake, Azure Cosmos DB and PostgreSQL with the pgvector extension fit this pattern because vector search is <a href="https://www.techtarget.com/searchdatamanagement/feature/Evaluating-the-different-types-of-DBMS-products">integrated into the broader platform</a>, rather than deployed as a separate system.</p> <p>For buyers, this path tends to make the most sense when governance continuity, platform simplicity and reusing the skills of existing operational teams matter more than introducing another specialized layer.</p> <h3>2. Upgrade the search layer</h3> <p>The second path is to upgrade the search layer. If most complaints focus on the search experience, the decision is less about the database and more about adopting a search-first layer optimized for relevance at scale.</p> <p>Search-first platforms are typically designed around the idea that retrieval quality comes from combining full-text search and <a href="https://www.techtarget.com/searchenterpriseai/tip/Top-RAG-tools">vector-based similarity search</a> with ranking and filtering across indexed content. Azure AI Search, Elasticsearch, OpenSearch, Apache Solr and Algolia belong in this broader category.</p> <p>This path is best when the enterprise needs stronger discovery, ranking and search quality across data sets, documents, knowledge bases, websites and other types of content. </p> <h3>3. Replace the existing vector platform or add a dedicated one</h3> <p>The third path is to replace or add a specialized vector platform. This should usually be the escalation path, not the default.</p> <p>While specialized platforms such as Pinecone, Weaviate, Qdrant and Milvus (including the Zilliz Cloud service) also offer hybrid search capabilities, they are centered on dedicated vector retrieval infrastructure. That can make sense when retrieval has become strategic enough to justify a separate platform, or when current data and search environments no longer fit the workload.</p> <p>A few use cases help clarify the three options. An enterprise building an internal knowledge assistant might not need a new platform if it can extend its existing data stack and improve search quality there. A company with a large digital content estate and weak search relevance might get more value from modernizing the search layer than from reworking the entire data platform. And a business that uses retrieval as a service across multiple AI products might decide it needs a dedicated vector platform.  </p> </section> <section class="section main-article-chapter" data-menu-title="Why relevance and governance steer retrieval choices"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why relevance and governance steer retrieval choices</h2> <p>Search and retrieval for AI are no longer just features that organizations set and forget. It is a core capability, and buying decisions now extend beyond whether a platform supports vector search.</p> <p>For many organizations, the primary requirement is relevance quality. Can the platform support intent‑driven search while still returning precise results for specific business terms? Hybrid search has become the baseline by combining semantic understanding and keyword matching in a single request.</p> <p>The second buying criterion is governance fit. Search and retrieval more frequently touch regulated, sensitive and business‑critical data. Can the platform work within the organization's governance model, rather than forcing new controls or workarounds?</p> <p>The governance requirement is <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-governance-for-AI-requires-a-cross-functional-approach">only getting sharper as AI expands</a>. A March 2026 Omdia <a target="_blank" href="https://research.esg-global.com/reportaction/515202164/Toc" rel="noopener">report</a> said 47% of 400 technical and business stakeholders cited data privacy as their organization's top risk in a generative AI initiative, and 38% underestimated security and governance costs. (Omdia is a division of Informa TechTarget.) Gartner's November 2025 report on cloud database management systems complements Omdia's findings, noting that metadata is emerging as the connective tissue for AI and search workflows.</p> <p>As platforms move toward data fabrics and self-governing systems, integrated metadata becomes central to governance, observability and operational control, requiring a search and retrieval platform that improves over time and holds up under evolving AI workloads.</p> <p><i>Tom Walat is an editor and reporter for TechTarget, where he covers data technologies.</i></p> </section> As AI workloads mature, enterprises face multiple data platform choices to improve search and retrieval capabilities while meeting governance and operational demands. https://cdn.ttgtmedia.com/rms/onlineimages/code_g684641103.jpg https://www.techtarget.com/searchdatamanagement/feature/Hybrid-search-demands-reshape-retrieval-frameworks-for-AI Tue, 14 Apr 2026 18:04:00 GMT Hybrid search demands reshape retrieval frameworks for AI <div> <div></div> </div> <p>As regulatory landscapes evolve in an increasingly data-driven world, organizations face increasing pressure to ensure compliance.</p> <p>Data-specific requirements govern <a href="https://www.techtarget.com/searchdatamanagement/feature/Big-data-collection-processes-challenges-and-best-practices">how organizations collect, store, process and share data</a>. Achieving compliance is an essential, ongoing activity that leadership must guide. To do so, executives must understand the various global regulatory requirements and the implications and risks of non-compliance. Successful compliance shows that the enterprise and its personnel fully embrace data governance across all data-related activities.</p> <section class="section main-article-chapter" data-menu-title="Important data governance regulations"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Important data governance regulations</h2> <p><a href="https://www.techtarget.com/searchdatamanagement/definition/data-governance">Data governance</a> is an umbrella term encompassing several important activities, including data lifecycle, stewardship, security, privacy, destruction, quality, retention, access, classification and management. Identifying the specific enterprise data governance requirements under each regulation is essential for leadership, especially as their business expands internationally.</p> <p>The following are data governance regulations that affect governance planning, strategies and procedures. Most of these laws apply to any organization doing business in the country, regardless of origin.  </p> <ul class="default-list"> <li><b>EU GDPR – </b>A pivotal piece of data protection legislation, <a href="https://www.techtarget.com/whatis/definition/General-Data-Protection-Regulation-GDPR">GDPR</a> protects EU residents' personal data. It specifies data management strategies that organizations must follow, including conducting a data protection impact assessment to identify and address any risks. Failure to comply may result in significant financial penalties -- up to €20 million ($23 million) or 4% of the firm's worldwide annual revenue.</li> <li><b>CCPA – </b>Consumers have the right to know how organizations collect and process their data. <a href="https://www.techtarget.com/searchcio/definition/California-Consumer-Privacy-Act-CCPA">CCPA</a> ensures California residents have the right to delete or limit the personal information organizations collect, to opt out of the sale of their data and correct inaccurate information. Violators can range from $2,663 (unintentional) to $7,988 (intentional).</li> <li><b>UK GDPR and Data Protection Act – </b>Enacted in 2018, <a href="https://www.techtarget.com/searchdatabackup/definition/Data-Protection-Act-2018-DPA-2018">this legislation</a> transposes the GDPR into UK law.<b> </b>It requires strong data security, collection and processing practices. Penalties can range from £8.7 million ($11.5 million) to £17.5 million ($23.1 million), or 2% to 4% of the company's worldwide annual revenue -- whichever is higher.</li> <li><b>HIPAA – </b><a href="https://www.techtarget.com/searchhealthit/definition/HIPAA">HIPAA Security and Privacy Rules</a> apply specifically to the US healthcare system and govern rules on data access, security, use, and protected health information disclosures. It requires risk assessments and employee training. Violations are either civil or criminal, and penalties vary based on severity. Unknowing civil offenders face fines as low as $100 per violation, while willful offenders face fines up to $50,000 per violation. Criminal incidents can result in a fine of up to $250,000 and 10 years in prison.</li> <li><b>EU Data Governance Act – </b>Launched in 2023, this legislation requires secure data sharing across the EU. It advocates data altruism, which examines how data can be used in the public interest. The act doesn't specify a blanket fine but offers criteria for determining penalties.</li> <li><b>Sarbanes-Oxley Act (SOX) – </b><a href="https://www.techtarget.com/searchcio/definition/Sarbanes-Oxley-Act">SOX legislation</a> addresses issues in<b> </b>financial management and reporting as applicable to all publicly traded companies in the US. It has strict controls on the accuracy, integrity, validation and verification of financial data. It also <a href="https://www.techtarget.com/searchcio/definition/What-is-SOX-compliance-A-complete-guide-and-checklist">mandates effectiveness assessments</a> for internal controls and data governance practices. Violators face 10 to 20 years in prison and hefty fines.</li> <li><b>UK Network and Information Systems regulations – </b>These regulations focus on cybersecurity and incident reporting for network and information services providers. Cybersecurity requirements include regular security assessments and continuous improvements. Penalties cost up to £17 million ($22.4 million).</li> <li><b>Gramm-Leach-Bliley Act (GLBA) – </b><a href="https://www.techtarget.com/searchcio/definition/Gramm-Leach-Bliley-Act">This US legislation</a> mandates financial organizations establish information disclosure policies, implement security programs and perform regular risk assessments. Noncompliance can result in a $100,000 fine per violation.</li> <li><b>Personal Information Protection Law (PIPL) –</b> China's data protection law is among the toughest globally, applying to all enterprises handling personal data within China's borders. It has strict consent and trans-border data flow requirements. Penalties for non-compliance include ¥50 million RBM ($7 million), 5% of annual revenue or shutting down enterprises.</li> <li><b>Digital Personal Data Protection Act (DPDPA) –</b> India's 2023 act requires data fiduciaries to provide customers notices of their rights and inform them of the type of data they're collecting and why, with specific restrictions on cross-border data flows. The <a href="https://www.techtarget.com/searchdatabackup/definition/Digital-Personal-Data-Protection-Act-2023">DPDPA</a> mandates consent for any processing, with additional requirements regarding children's data. Penalties include up to ₹250 crore ($26.9 million).</li> <li><b>Personal Data Protection Act –</b> Developed in Singapore, thi<i>s </i>legislation is widely recognized throughout the Asia-Pacific region. It is consent-driven, mandates breach alerts and has retention limitations. If a company exceeds S$10 million ($7.7 million) in annual turnover in Singapore, it faces financial penalties up to 10% of that annual turnover. Otherwise, fines cannot exceed S$1 million ($778,000).</li> <li><b>Personal Data Protection Law –</b> The UAE law regulates personal data processing, requiring consent and security, as well as strict rules for trans-border data flows. It gives individuals the right to correct inaccuracies and stop processing upon request. Noncompliance results in fines up to AED 5 million ($1.36 million).</li> <li><b>Law 0908 on Personal Data Protection – </b>Morocco's legislation is one of Africa's most comprehensive data protection statutes. It requires organizations to register with the national government. Penalties for noncompliance include fines up to MAD 600,000 ($64,343) and/or imprisonment from three months to four years.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="Non-compliance risks for executives"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Non-compliance risks for executives</h2> <p paraeid="{2c1fe5c8-d26a-421b-895d-711c4f19c464}{16}" paraid="2110754361">While data governance is very much a technology-centered activity, it is also an executive accountability issue. If <a href="https://www.techtarget.com/searchdatamanagement/tip/5-benefits-of-building-a-strong-data-governance-strategy">data governance initiatives</a> result in regulatory violations, improper AI use or data-related incidents, the highest levels of enterprise leadership -- including the C-suite and the board -- are liable. Penalties include fines, litigation, reputational damage and competitive risks.</p> </section> <section class="section main-article-chapter" data-menu-title="Compliance resources"> <h2 class="section-title"><i class="icon" data-icon="1"></i><iframe title="Risks of non-compliance with data regulations" aria-label="Table" id="datawrapper-chart-wuECR" src="https://datawrapper.dwcdn.net/wuECR/1/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="819" data-external="1"></iframe><span xml:lang="EN-US" data-contrast="none"><span data-ccp-parastyle="heading 2"></span></span><em></em>Compliance resources</h2> <p>Many organizations have created <a href="https://www.techtarget.com/searchdatamanagement/tip/5-data-governance-framework-examples">data governance frameworks</a> that help enterprises establish data governance capabilities, including the following:</p> <ul class="default-list"> <li><b>Data Management Body of Knowledge (DAMA-DMBOK) – </b>Considered the industry standard for data governance, DAMA-DMBOK addresses data quality, stewardship and metadata, among other issues.</li> <li><b>Control Objectives for Information and Related Technologies (COBIT) – </b>Developed by ISACA, <a href="https://www.techtarget.com/searchsecurity/definition/COBIT">COBIT</a> offers strong controls and audit guidelines that align IT governance with business risk management and strategy.</li> <li><b>NIST Cybersecurity & Privacy Frameworks – </b>NIST has two data governance frameworks: the<b> </b><a href="https://www.techtarget.com/searchsecurity/definition/NIST-Cybersecurity-Framework">Cybersecurity Framework</a> for reducing cybersecurity risks<b> </b>and<b> </b>the Privacy Framework to identify and manage privacy risks.</li> <li><b>ISO/IEC 38500 – </b>Most recently updated in 2024, this standard is a key international standard for IT governance. It addresses legal, regulatory and ethical data use and provides vocabulary for IT governance.</li> <li><b>Data Management Capability Assessment Model (DCAM) – </b>Developed by the EDM Council, this framework defines a maturity model addressing data governance, quality and architecture.</li> </ul> <p>A <a href="https://www.techtarget.com/searchdatamanagement/feature/15-top-data-governance-tools-to-know-about">variety of tools and resources</a> can help demonstrate compliance, including master data management tools, data discovery and classification tools, data catalogs and IAM systems. Senior management support and budget funding are essential for establishing a mature data governance program.</p> <p>Consider investing in AI tools, which can greatly improve performance, provide better data analytics, automate repetitive processes and identify potential compliance issues. Existing tools and resources might have upgraded versions with AI capabilities.</p> </section> <section class="section main-article-chapter" data-menu-title="How to achieve compliance"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How to achieve compliance</h2> <p>The following are best practices for executives to achieve optimal data governance compliance outcomes.</p> <h3>Be accountable for and own data governance</h3> <p>Just as organizations should have data owners and stewards for different domains, they should also <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-governance-responsibilities-now-belong-in-the-C-suite">make an executive responsible for data governance</a> and compliance activities. Responsibilities include defining and measuring KPIs, conducting periodic board-level governance briefings and establishing partnerships with other departments, such as legal, HR, risk management and operations.</p> <h3>Ensure that data governance is risk-based</h3> <p>Establish data governance as a primary risk area. Add governance to a corporate risk register and examine risk from financial and regulatory perspectives. Map governance controls to appropriate regulations and frameworks. Building scenarios to address specific risk events, such as trans-border data violations, will help if they ever occur.</p> <h3>Require auditable evidence on compliance activities</h3> <p>Demonstrating data governance compliance at any time is essential in case of unannounced audits. Evidence of compliance includes project reports, compliance testing results, access management issues, <a href="https://www.techtarget.com/searchdatamanagement/tip/Evaluating-data-quality-requires-clear-and-measurable-KPIs">data quality measurements</a> and retention/deletion rules. Schedule quarterly audits for relevant controls and create evidence trails for regulator inquiries.</p> <h3>Optimize data quality at the C-Level</h3> <p>Data quality and lineage must be primary goals. Establish strong controls addressing <a href="https://www.techtarget.com/searchdatamanagement/feature/How-data-lineage-became-a-boardroom-metric">data quality, lineage and accuracy</a>. Enforce data quality standards, launch quality checks and link metrics to business requirements. Establish beginning-to-end data lineage and ensure access to it.</p> <h3>Enforce data access controls</h3> <p>Senior leaders must ensure data access controls are consistently monitored, enforced and applied. Implement least privilege, role-based access controls, multi-factor authentication, segregation of duties and uninterrupted monitoring. Provide support for potential audits.</p> <h3>Culture of compliance</h3> <p>This starts with the C-suite and board. Mandate training for all employees on data-related activities and endorse data literacy throughout the enterprise. Regularly reiterate the importance of data governance at major company meetings. Support whistleblowing of any violations and note governance issues in performance reviews.</p> <p>Linking all governance activities into a cohesive process also helps with compliance.<b> </b>Data silos can spell disaster. Greater information sharing, along with the integration of security and privacy capabilities across systems, helps avoid this.</p> <h3>Acquire technology that facilitates the compliance process</h3> <p>The right technology ensures that governance activities are scalable, adaptable and automated. Automate data governance activities by integrating <a href="https://www.techtarget.com/searchdatamanagement/feature/AI-data-governance-guidance-that-gets-you-to-the-finish-line">AI tools</a> with risk, privacy and security systems. However, be sure to provide <a href="https://www.techtarget.com/searchenterpriseai/definition/AI-governance">AI governance</a> oversight. When used correctly, it can facilitate the following data processes:</p> <ul class="default-list"> <li>Reduce the likelihood of human error.</li> <li>Improve performance.</li> <li>Automate repetitive tasks such as data collection and classification.</li> <li>Identify potential compliance issues.</li> <li>Deliver reports for auditors.</li> <li>Ensure cross-border flow adheres to regulations.</li> </ul> <h3>Ensure that knowledge of regulatory activities is current</h3> <p>Adapting to <a href="https://www.techtarget.com/searchsecurity/tip/State-of-data-privacy-laws">regulatory changes</a> and maintaining compliance are essential for enterprises. Executives should consistently monitor the global regulatory landscape. Review and assess regulatory changes, keep policies current and train governance teams to do the same.</p> <p><em>Paul Kirvan, FBCI, CISA, is an independent consultant and technical writer with more than 35 years of experience in business continuity, disaster recovery, resilience, cybersecurity, GRC, telecom and technical writing.</em></p> </section> Growing national and international regulatory compliance demands aim to protect consumer data. Organizations must adhere to regulations or face noncompliance risks. https://cdn.ttgtmedia.com/rms/onlineimages/legal_g1065824400.jpg https://www.techtarget.com/searchdatamanagement/tip/Data-governance-regulations-that-executives-should-know Wed, 08 Apr 2026 15:31:00 GMT Data governance regulations that executives should know <p>AI and cloud analytics applications are exposing a critical security gap for enterprises. While data is typically secured at rest and in transit, it often remains unprotected when being processed -- the time it is most actively used.</p> <p>This gap has pushed data-in-use protection higher on the agenda for data leaders. Within the broader landscape of privacy‑enhancing technologies, <a href="https://www.techtarget.com/searchsecurity/tip/Confidential-computing-use-cases-that-secure-data-in-use">confidential computing has emerged</a> as the primary way to address this processing‑stage risk. It uses hardware‑isolated trusted execution environments (TEEs) to keep data encrypted during computation, enabling teams to expand AI workloads without overhauling data pipelines or weakening security.</p> <p>Adoption trends suggest confidential computing is moving from a specialized control to a baseline expectation for AI and cloud analytics deployments. In a 2024 report, for example, Grand View Research projected the global market for confidential computing would grow from an estimated $5.46 billion in 2023 to $153.8 billion by 2030, reflecting its increased role as a foundational component of data security.</p> <section class="section main-article-chapter" data-menu-title="How data-in-use protection fits into existing pipelines"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How data-in-use protection fits into existing pipelines</h2> <p>Standard security leaves data vulnerable in system memory and CPUs. Data-in-use protection addresses this exposure problem by keeping information encrypted while workloads execute.</p> <p>At the hardware layer, a <a href="https://www.techtarget.com/searchitoperations/definition/trusted-execution-environment-TEE">TEE</a> is a secure area that runs code and processes data independently from the rest of the system. It isolates data and processing operations to prevent unauthorized access. Even cloud administrators, host OSes and hypervisors do not have access to the data in a TEE.</p> <p>Because confidential computing operates at the infrastructure layer, AI training and analytics jobs can often run in a TEE with minimal architectural changes. TEEs also transparently encrypt processing for applications, minimizing operational disruption while extending protection throughout the compute stage.</p> </section> <section class="section main-article-chapter" data-menu-title="Compliance pressure moves into the processing layer"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Compliance pressure moves into the processing layer</h2> <p>Rapidly evolving regulations are reshaping where organizations invest to secure AI and analytics workloads. A 2025 Stanford report found that <a href="https://www.techtarget.com/searchenterpriseai/feature/AI-regulation-What-businesses-need-to-know">AI-related regulations</a> issued by U.S. federal agencies more than doubled from 25 in 2023 to 59 in 2024. Similarly, the number of AI-related laws passed at the state level increased from 49 to 131.</p> <p>Gartner predicts that by 2029, confidential computing will be used to secure more than 75% of processing operations running in shared infrastructure, such as public cloud services.</p> <p>As sensitive data moves into AI pipelines, the pressure to document security grows. Processing-stage exposure is difficult to control and even harder to record without hardware-based locks. Audit teams and data governance functions that once focused only on storage encryption now require attestation that processing workloads run in protected environments.<br><br>Several regulatory frameworks now explicitly require data-in-use protection:</p> <ul class="default-list"> <li><b>EU AI Act.</b> This new regulation requires documented data governance controls, including evidence of protection during <a href="https://www.techtarget.com/searchenterpriseai/tip/10-steps-to-achieve-AI-implementation-in-your-business">all AI lifecycle stages</a>.</li> <li><b>GDPR.</b> Enforcement of the EU's data privacy regulation is expanding to include data in use, not just in storage or transit.</li> <li><b>PCI DSS v4.0.1.</b> Requirements prevent sensitive authentication data from persisting in memory, such as RAM or memory dumps.</li> <li><b>Digital Operational Resilience Act.</b> DORA mandates data-in-use protection for major EU financial institutions, including controls on data handling within cloud and third‑party processing environments.</li> <li><b>NIST Cybersecurity Framework 2.0.</b> Commonly known as CSF 2.0, it <a target="_blank" href="https://www.nist.gov/cyberframework" rel="noopener">includes</a> data-in-use protection within zero-trust security designs.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="What data leaders gain from data-in-use protection"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What data leaders gain from data-in-use protection</h2> <p>For data leaders, the value of confidential computing aligns with governance, legal and audit functions.</p> <ul class="default-list"> <li><b>Secure access to sensitive data. </b>Healthcare records, financial transaction data, personally identifiable information and other regulated data often aren't used in AI and analytics initiatives due to processing risks. Confidential computing enables access to sensitive data sets without violating governance and security rules.</li> <li><b>Reduced legal exposure</b>. Confidential computing provides verifiable proof that sensitive workloads are processed in hardware-isolated environments, which is especially valuable for documenting regulatory compliance in third-party clouds.</li> <li><b>Increased audit efficiency.</b> The records of secure processing automatically produced by TEEs also reduce manual auditing work and improve audit verification on the use of sensitive data.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="Confidential computing use cases"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Confidential computing use cases</h2> <p>The clearest proof of concept for confidential computing comes from highly regulated industries, where organizations face strict requirements around data handling, auditability and cross boundary data sharing.</p> <ul class="default-list"> <li><b>Healthcare. </b>Hospitals and clinical research networks use confidential computing to support federated AI model training across institutions, keeping patient data private in shared systems or central databases.</li> <li><b>Financial services.</b> Banks and insurers use TEEs for fraud detection and risk modeling to reduce exposure when processing sensitive transaction data regulated by banking privacy rules.</li> <li><b>Public sector.</b> Agencies and partner organizations apply confidential computing to joint analytics projects without sharing raw data across organizational boundaries.</li> <li><b>Telecom and IoT.</b> Providers can use confidential computing to analyze customer and device data closer to the edge while limiting exposure during processing.</li> </ul> <p>Across these industries, common use cases include secure AI training, multi‑party analytics, <a href="https://www.techtarget.com/searchenterpriseai/tip/How-to-navigate-data-sovereignty-for-AI-compliance">data sovereignty controls</a>, and cloud backup and recovery workflows where restore operations can expose sensitive data.</p> </section> <section class="section main-article-chapter" data-menu-title="How to evaluate data‑in‑use protection options"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How to evaluate data‑in‑use protection options</h2> <p>Approaches to data‑in‑use protection vary across cloud providers and the broader vendor ecosystem that includes data platforms, security and key management tools, and systems integrators. Before committing to a platform, data leaders should focus on proof points, regulatory alignment and how well it integrates with existing controls.</p> <p><iframe title="" aria-label="Table" id="datawrapper-chart-KuZgD" src="https://datawrapper.dwcdn.net/KuZgD/1/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="721" data-external="1"></iframe></p> <p> <script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script> </p> <p><i>Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.</i></p> </section> As sensitive data moves into AI pipelines, organizations must evaluate how to protect it during processing and what safeguards IT platforms provide for data in use. https://cdn.ttgtmedia.com/rms/onlineimages/security_a385093447.jpg https://www.techtarget.com/searchdatamanagement/tip/AI-analytics-push-data-in-use-protection-up-priority-list Tue, 07 Apr 2026 15:34:00 GMT AI, analytics push data-in-use protection up priority list <p>Without properly prepared data, analytics and AI applications are unlikely to deliver the desired business outcomes. But <a href="https://www.techtarget.com/searchbusinessanalytics/definition/data-preparation">data preparation</a> is an inherently complex process that poses various challenges for data management and analytics teams.</p> <p>Preparing data for planned uses requires substantial amounts of time and resources. Indeed, it typically accounts for most of the work involved in developing analytics applications. Large amounts of data in diverse formats <a href="https://www.techtarget.com/searchdatamanagement/feature/Big-data-collection-processes-challenges-and-best-practices">collected from numerous sources</a> must be combined and consolidated. The raw data routinely contains errors, anomalies, inconsistencies and other <a href="https://www.techtarget.com/searchdatamanagement/feature/Proactive-practices-for-data-quality-improvement">data quality issues</a>. Data sets might not include all the information an application requires. Conversely, some data might not be relevant to it.</p> <p>Data preparation tools -- available as separate products or built into BI and data science platforms -- enable data scientists, data engineers, business analysts and other end users to prepare data themselves. However, these tools don't eliminate the challenges of data preparation. Data leaders must ensure users are sufficiently trained on the data prep process, including common challenges.</p> <p>Effective data preparation also requires a multipronged approach. To aid self-service users, data quality analysts <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-cleansing-best-practices">profile and cleanse data</a> upfront. Data integration developers run initial data transformation jobs. BI teams further transform, enrich and curate data sets for planned applications. They, too, must be prepared for the challenges of preparing data.</p> <section class="section main-article-chapter" data-menu-title="7 top data preparation challenges"> <h2 class="section-title"><i class="icon" data-icon="1"></i>7 top data preparation challenges</h2> <p>Because of its complexity, data preparation can't be left to chance. The following are seven notable challenges that disrupt efforts to create clean, consistent and complete data sets, along with advice on how to overcome each one.</p> <h3>1. Inadequate or erroneous data profiling</h3> <p>Data profiling should prevent end users from belatedly discovering data issues when running analytics applications -- or, worse, from having the analytics results be affected by faulty data they aren't aware of. But it might not do so due to the following scenarios:</p> <ul class="default-list"> <li>Data team members or business users preparing data for a new application assume it's valid because it's already used in reports and dashboards. As a result, they don't fully profile the data. However, the existing uses masked underlying problems in the data set.</li> <li>Someone only profiles a sample data set from a large volume of data because of the time it would take to profile the full one. But the sampling approach doesn't detect anomalies and other issues in the full data set.</li> <li>Similarly, custom-coded SQL queries or spreadsheet functions used to profile data aren't comprehensive enough to find all the problems in the data.</li> </ul> <p><b>How to overcome this challenge<br></b>Solid data profiling must be the starting point of the data preparation process. Data preparation tools can help: They include comprehensive functionality for profiling data sets in both source systems and the data platforms that analytics and AI applications run on.</p> <h3>2. Missing or incomplete data</h3> <p>Missing values and incomplete entries are common data quality issues. Examples include:</p> <ul class="default-list"> <li>Null or blank fields.</li> <li>Zeros that represent a missing value rather than the number 0.</li> <li>Other types of placeholder values.</li> <li>Partial transaction records with missing details.</li> <li>Incomplete demographic data on customers.</li> <li>An entire field or row that's missing from a data set.</li> </ul> <p>Missing or incomplete data can adversely affect business decisions driven by analytics applications and create <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-challenges-that-can-sink-data-operations">data governance and regulatory compliance risks</a>. It might also disrupt data loading processes or cause them to fail completely, forcing data teams to scramble to figure out what went wrong.</p> <p>As a result, instances of missing or incomplete data raise complicated data preparation questions. Do they represent substantive data errors? If so, can valid data be inserted? If it can't be, should affected fields be deleted or kept but flagged to show users there are issues with the data?</p> <p><b>How to overcome this challenge<br></b>Effective data profiling identifies missing or incomplete data. Decide what to do about it based on planned use cases and the significance of the data errors. Optimally, data teams or end users should then use a data preparation tool to implement the error-handling measures.</p> <h3>3. Invalid data values</h3> <p>Invalid values are another common data quality issue. They include misspellings, transposed digits, unnecessary characters, duplicate entries and outliers, such as ages, dates and numbers that aren't within a reasonable range. These errors can occur even in enterprise applications with built-in data validation features and end up in analytics and AI data sets.</p> <p>A small number of invalid values in a data set might not have a meaningful impact on applications, but more numerous errors can lead to faulty data analysis results. Cleaning them up should be a priority during data preparation.</p> <p><b>How to overcome this challenge<br></b>Finding and fixing invalid data is similar to handling missing values: Profile the data, decide what to do about errors and implement automated functions to address them. Data profiling should also be done on an ongoing basis to identify new issues as data is updated. Perfection is unlikely -- some data errors inevitably slip through. But minimizing them will prevent bad analytics-driven business decisions.</p> <h3>4. Name and address standardization</h3> <p>Inconsistencies in the names, addresses and contact information of consumers and businesses also complicate data preparation. These are legitimate data variations in different systems, not misspellings or missing values. But if not standardized, they can prevent analytics users and AI tools from getting a complete view of customers, suppliers and other business partners.</p> <p>The following are common examples of such inconsistencies:</p> <ul class="default-list"> <li>A shortened first name or nickname versus a person's full name, such as Fred in one data field and Frederick in another.</li> <li>Middle initial, full middle name or neither.</li> <li>Acronyms vs. full business names, such as BMW and Bayerische Motoren Werke.</li> <li>Companies listed both with and without Inc., Co., Corp., LLC and other business suffixes.</li> <li>Spelled-out vs. abbreviated address data, such as Boulevard and Blvd. or New York and NY.</li> <li>Different phone numbers and email addresses for the same entity.</li> </ul> <p><b>How to overcome this challenge<br></b>Identify inconsistencies through data profiling, then use the standardization features built into a data preparation tool. Alternatively, data teams can create customized standardization processes with a data prep tool's string-handling functionality or use software from a vendor that specializes in name and address standardization.</p> <h3>5. Inconsistent data across enterprise systems</h3> <p>Organizations also encounter inconsistencies when combining data from systems in multiple departments or business units. The data might be correct in each source system, but differences in data formats and entries create problems for analytics and AI applications. It's a pervasive data preparation challenge, especially in large enterprises.</p> <p><b>How to overcome this challenge<br></b>When a data attribute, such as an ID field, has different values across source systems, data conversion or cross-reference mapping procedures provide a relatively easy fix. However, if different business rules or data definitions lead to inconsistencies, more complex data transformations are required.</p> <h3>6. Data enrichment issues</h3> <p>Data enrichment helps create the required business context for effective analytics and AI uses. The following are examples of enrichment measures implemented when preparing data:</p> <ul class="default-list"> <li>Augmenting data with entries from other internal or external sources.</li> <li>Deriving additional data attributes from the existing ones in a data set.</li> <li>Calculating business metrics and KPIs based on the data.</li> <li>Organizing data into different structures for planned applications.</li> <li>Adding tags, labels and metadata to help users understand the data.</li> </ul> <p>But enriching data isn't easy. Deciding what needs to be done is complicated, and enrichment work can be time-consuming.</p> <p><b>How to overcome this challenge<br></b>Data enrichment requires a strong understanding of business needs and goals for the planned applications. Work closely with business executives and users to develop enrichment plans, and allot sufficient resources to the process to meet application delivery schedules.</p> <h3>7. Sustaining and scaling data preparation processes</h3> <p>While data teams and end users sometimes prepare data on an ad hoc basis, data preparation work often becomes a recurring process. Its scope also expands as analytics and AI applications grow and become more widespread -- and valuable -- in enterprises. But organizations often struggle to sustain and scale their data preparation initiatives.</p> <p>Insufficient resources and skills are a problem in some cases. Using custom-coded data preparation methods is, too. If there's no documentation of a custom-coded process, its creator might be the only person who understands how it works, which makes it hard to continue the process if they leave. Also, when modifications to a process are needed, bolting on new code makes maintaining it even more difficult.</p> <p><b>How to overcome this challenge<br></b>Ensure that data preparation programs have the required resources and that data teams and end users are properly trained. Using data preparation tools also helps avoid the traps of custom coding. They automatically document processes and <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-lineage-documentation-imperative-to-data-quality">track data lineage and use</a>, while also providing AI capabilities, collaboration features and connectors to various data sources.</p> <p><b>Editor's note:</b> <i>This article was originally published in 2022. TechTarget editors updated it in March 2026 for timeliness and to add new information.</i></p> <p><i>Rick Sherman, who died in January 2023, was founder and managing partner of Athena Solutions, a BI, data warehousing and data management consulting firm. He had more than 40 years of professional experience in those fields.</i></p> </section> Data preparation is a crucial but complex part of analytics and AI applications. Don't let these seven common challenges send your data prep processes off track. https://cdn.ttgtmedia.com/rms/onlineimages/storage_g539954410.jpg https://www.techtarget.com/searchbusinessanalytics/feature/Top-data-preparation-challenges-and-how-to-overcome-them Tue, 31 Mar 2026 11:00:00 GMT Top data preparation challenges and how to overcome them <p>Enterprises are finding the data infrastructure setups that served them well in the past cannot keep up with today's AI reality.</p> <p>A shift from traditional data architectures to a modern data stack is accelerating thanks to an avalanche of AI initiatives -- and a <a href="https://www.techtarget.com/searchenterpriseai/feature/AI-deployments-gone-wrong-The-fallout-and-lessons-learned">lack of trust in the data</a> feeding AI systems. Survey results highlight the problems. Deloitte's 2026 "State of AI in the Enterprise" global survey found that while the number of senior IT and business executives who feel prepared for AI adoption strategically rose to 42 percent from 39 percent the previous year, confidence in their organization's technology infrastructure and data management capabilities declined from 47 percent to 43 percent and from 43 percent to 40 percent, respectively. A 2025 IDC study reported that 84 percent of companies have outdated storage that is not optimal for demanding AI workloads.</p> <p>For enterprise data leaders, it's increasingly a priority to update aging data infrastructure so AI can be deployed with confidence while also modernizing governance and day‑to‑day data management practices that keep AI models reliable and automated decisions defensible.</p> <section class="section main-article-chapter" data-menu-title="From big data complexity to streamlined AI-ready infrastructure"> <h2 class="section-title"><i class="icon" data-icon="1"></i>From big data complexity to streamlined AI-ready infrastructure</h2> <p>The enterprise data stack is evolving out of necessity. To compete in the AI-first economy, organizations are moving <a href="https://www.techtarget.com/searchdatamanagement/opinion/2026-will-be-the-year-data-becomes-truly-intelligent">toward data as a product</a>. This shift replaces brittle, manual workflows with a governed platform designed for scalability, safety and reuse. Under this modern data stack model, IT and data teams provide a secure, shared foundation, while business units maintain ownership of the application outcomes.</p> <p>At each stage of this multilayered approach, data is refined and validated until it is transformed from its raw state into a reusable asset. As organizations roll out autonomous AI agents, this level of granular control over data and <a href="https://www.techtarget.com/searchdatamanagement/post/Key-requirements-for-data-and-analytics-governance-platforms">comprehensive governance</a> is a prerequisite for safe, reliable AI applications at scale.</p> <p>Lists of modern data stack layers aren't standardized, and terminology often differs by the source. However, these are its core elements.</p> <h3>1. Ingestion layer</h3> <p>The first layer covers <a href="https://www.techtarget.com/searchbusinessanalytics/tip/6-essential-big-data-best-practices-for-businesses">data collection</a> and contains the necessary base infrastructure, including compute resources, networking, cloud services and security controls. In traditional data frameworks, this was largely an IT concern, but it is now a strategic design decision upon which the business goals of data-driven applications rest. It's no longer a choice between on-premises and cloud deployments. Instead, data leaders are designing tailored hybrid infrastructures to distribute processing across on-premises systems for data sovereignty, edge locations for real-time AI performance and cloud environments for scalable compute.</p> <p>Teams can use push or pull methods to ingest data from a wide range of internal and external data sources, such as cloud applications and streaming services. In the modern data stack, there is more of a vetting process. Just because vast amounts of data can be ingested into the infrastructure doesn't mean all of it should be. The modern approach also applies a higher bar for data quality, lineage and provenance. The biggest risk in this stage is fragmentation. If data sources remain disconnected, then teams must manually integrate and clean data and redo engineering work, which slows business processes.  </p> <h3>2. Storage layer</h3> <p>In traditional data infrastructure, this layer is often a chaotic catch-all. Companies put their ingested raw data in multiple, disconnected databases, which results in conflicting versions of the truth. This legacy approach makes ensuring AI reliability nearly impossible because there is no single, governed source of information. Data warehouses emerged first to consolidate structured data for BI and fast querying. Later, organizations used data lakes to store unprocessed data to support analytics and AI work. However, operating both a data warehouse and data lake creates redundancies with separate systems for storing and managing different data, which adds to governance and security overhead.</p> <p><a href="https://www.techtarget.com/searchdatamanagement/feature/The-differences-between-a-data-warehouse-vs-data-mart">To avoid these data silos</a> in the modern data stack, organizations are now moving to data lakehouses, which combine the cost efficiency of data lakes with the performance of warehouses. The lakehouse architecture enables unified governance by building a metadata layer that oversees both raw and processed data. Also, by using open table formats to build an organization-wide system of record, companies create a consistent foundation for AI model development. This method improves data processing by reducing the need for unnecessary copies of data and manual engineering.</p> <h3>3. Processing layer</h3> <p>This layer turns the raw data into workable assets, ready to be analyzed or fed into AI models. Processing involves preparing both batch data sets at rest and streaming data in motion for downstream analytics and AI use. This data transformation and curation process includes cleansing, standardizing, enriching, filtering, joining and aggregating the data.</p> <p>In the modern data stack, this layer scales beyond the traditional nightly data update cycle designed for BI dashboard environments. The processing layer must handle real-time updates, <a href="https://www.techtarget.com/searchenterpriseai/definition/multimodal-AI">multimodal</a> inputs and automated lineage capture that documents every transformation. This ensures the data's journey from raw to refined is traceable and reduces the risk that AI models will <a href="https://www.techtarget.com/searcherp/podcast/Can-industry-process-models-fix-the-agentic-AI-data-problem">produce hallucinations and other errors</a>. Stream processing enables automated alerts and recommendations to be surfaced as quickly as possible so end users and autonomous agents can take immediate actions.</p> <p>Data leaders should ensure their updated infrastructure can handle this additional work without requiring a patchwork of tools and handoffs, which could create governance gaps.</p> <h3>4. Management and distribution layer</h3> <p>In this layer, the processed data is organized so it is fit for purpose. Built-in features work together not just to make the data available but also to ensure it can be governed and discovered. The work here includes data cataloging, lineage visibility, governance policy enforcement and facilitation of data discovery by downstream users.</p> <p>This is the most critical layer and often determines whether the entire modern data stack succeeds or fails. Ultimately, how most businesses operate today depends on data trustworthiness. Gartner predicts that 50 percent of organizations will use a <a href="https://www.techtarget.com/searchdatamanagement/tip/5-benefits-of-building-a-strong-data-governance-strategy">zero-trust model for data governance</a> by 2028 due to increasing AI adoption. With the growth of AI-generated data, automated data verification and active metadata management in this layer are essential pieces of the zero-trust governance approach.</p> <p>This layer tends to focus on either data mesh or data fabric architectures, each designed to make it easier for users to locate and share data without added complications. A data mesh is built on distributed domain ownership, where different departments are responsible for their own data under a federated governance structure, while a data fabric uses metadata and automated integration capabilities to join divided data assets and make it easier to reuse them.</p> <h3>5. Context and semantic layer</h3> <p>This is the layer where business logic is applied to both refined and raw data, giving it meaning. This context helps end users, AI systems and automation technologies understand how data should be interpreted across the organization.</p> <p>Shared definitions, knowledge graphs, metrics and other structures provide semantic consistency. Connecting context and semantics to data lineage and access policies reduces decision-making time for users and AI tools alike by removing the need to question whether data is relevant to applications.  </p> <h3>6. Integrity and quality layer</h3> <p>This layer maintains the fidelity of data as it moves through the stack. It combines data observability, data stewardship, data quality checks and privacy controls to <a href="https://www.techtarget.com/searchdatamanagement/opinion/Data-contracts-help-build-trustworthy-data-products-for-AI">ensure data is accurate</a>, consistent, documented and protected for effective decision-making.</p> <p>This arrangement provides structure to the stack to prevent unreliable data feeds and data silos. Data quality rules identify missing values, data duplication and freshness issues. Master data management practices create common records for business entities, such as customers and products, to maintain consistency across systems. Data stewards apply governance and security policies that <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-challenges-that-can-sink-data-operations">dictate who gets access to data and when</a>.</p> <h3>7. Consumption layer</h3> <p>This is the top of the stack, the culmination of all the architectural choices designed to produce refined, trusted data and get it to the right users and systems at the right time.  </p> <p>Traditionally, consumption meant dashboards, reports and analytics tools, but it now includes embedded analytics, machine learning applications, and agentic AI or semi-autonomous workflows. Rather than simply adding AI to old processes, data leaders are redesigning this layer so <a href="https://www.techtarget.com/searchsoftwarequality/tip/How-effective-is-your-AI-agent-benchmarks-to-consider">agents and people can work collaboratively</a> with clear decision-making boundaries, ensuring IT provides the platform while business units determine results.</p> <div class="youtube-iframe-container"> <iframe id="ytplayer-0" src="https://www.youtube.com/embed/7FufIRExfpo?autoplay=0&modestbranding=1&rel=0&widget_referrer=null&enablejsapi=1&origin=https://www.techtarget.com" type="text/html" height="360" width="640" frameborder="0"></iframe> </div> </section> <section class="section main-article-chapter" data-menu-title="What matters most when reassessing the data stack"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What matters most when reassessing the data stack</h2> <p>When it's time to update how your organization processes data and data platform vendors come calling, prepare product evaluation questions to meet your specific needs rather than getting lost in talks about performance and feature checklists.</p> <p>AI initiatives introduce a new set of requirements beyond the capabilities of existing data architectures. Today, the priorities include avoiding data duplication, improved data portability, strong lineage and consistency across departments and clouds.</p> <p>Tailor these modern data stack platform requirements for your organization, but these are some questions to ask:</p> <ul class="default-list"> <li>Does the platform provide a unified semantic layer and active metadata to ensure consistent logic across AI agents and BI applications?</li> <li>Does the platform support hybrid cloud and multi-cloud deployments by design for seamless workload migration based on cost, performance or data sovereignty requirements?</li> <li>Does it have policy-as-code <a target="_blank" href="https://www.cncf.io/blog/2025/07/29/introduction-to-policy-as-code/" rel="noopener">capabilities</a> to standardize data governance, privacy and quality across data assets, and AI models and agents?</li> <li>What are the platform's capabilities related to open table formats, APIs and portable pipelines to avoid extensive work when moving data and workloads?</li> <li>What is the status of agentic AI governance, and what are the plans to close any oversight gaps?</li> <li>Is there a single management interface for data stewards to monitor policy enforcement and issue resolution?</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="What's coming next for the modern data stack?"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What's coming next for the modern data stack?</h2> <p>All signals from leading analyst firms indicate the next evolution of the data stack will refine context awareness, tighten governance and integrate more closely with business workflows and agentic AI systems. These trends are linked: as companies increasingly deploy agents, they need richer context and stronger data controls. Deloitte's 2026 AI survey <a target="_blank" href="https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html" rel="noopener">found</a> that while 74 percent of companies plan to deploy agentic AI within two years, only 21 percent have a governance model for them now.</p> <p>Vendors are converging the stack, joining layers, <a href="https://www.techtarget.com/searchdatamanagement/news/366631576/New-consortium-to-aid-AI-by-standardizing-semantic-modeling">improving semantic structure</a> and embedding oversight. They are moving toward a unified, governed data lakehouse to reduce redundant copies and data movement across silos, cutting costs and security risks. This architecture supports the federated, shared ownership model in which business leaders set standards and quality expectations, while IT manages the data lakehouse and enforces policies to keep data and AI aligned at scale.</p> <p>For organizations reassessing their existing stack architecture, take a modular approach. Avoid overbuying and focus on the immediate needs for data context and trust. This provides flexibility to get AI and analytics work done today rather than a rigid, expensive redesign that might be obsolete in a few years.</p> <p><b>Editor's note</b><i>: TechTarget editors updated this article, originally published in 2023 and written by <a href="https://www.techtarget.com/contributor/Jeff-McCormick">Jeff McCormick</a>, in March 2026 to add new information and improve timeliness.</i></p> <p><i>Tom Walat is an editor and reporter for TechTarget, where he covers data technologies.</i></p> </section> Data infrastructure and practices need an upgrade for the AI era. A modern data stack takes a layered approach that aligns teams and delivers governed, trusted data. https://cdn.ttgtmedia.com/rms/onlineimages/container_g488602622.jpg https://www.techtarget.com/searchdatamanagement/tip/Assemble-the-layers-of-big-data-stack-architecture Fri, 20 Mar 2026 10:24:00 GMT Understanding the layers of the AI‑ready modern data stack <p>Smart organizations use big data to better understand their customers, identify market trends and improve business operations, thereby boosting financial performance and gaining a competitive advantage over rivals. However, investing in big data technologies and applications without a strategic plan is a recipe for wasting time, money and resources.</p> <p>Without a well-defined strategy for <a href="https://www.techtarget.com/searchdatamanagement/The-ultimate-guide-to-big-data-for-businesses">managing and using big data assets</a>, a company might end up with separate, uncoordinated initiatives. There's a risk of duplicate or conflicting analytics and AI projects, as well as ones that aren't aligned with strategic business objectives.</p> <p>Developing a comprehensive strategy to underpin <a href="https://www.techtarget.com/searchbusinessanalytics/feature/8-big-data-use-cases-for-businesses-and-industry-examples">big data applications</a> is easier said than done, but the guidance and steps outlined below will help data leaders manage the process.</p> <section class="section main-article-chapter" data-menu-title="What a big data strategy includes"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What a big data strategy includes</h2> <p>An effective big data strategy maps out how the data will be used to support business processes and decision-making. It defines specific business goals for big data applications and sets guidelines for using data to ensure compliance with privacy and regulatory requirements. To align big data initiatives with business needs and objectives, business leaders must be involved in developing the strategy from start to finish.</p> <p>The strategy also specifies procedures for managing the big data environment. That includes details on how data management and analytics teams will <a href="https://www.techtarget.com/searchdatamanagement/tip/10-big-data-challenges-and-how-to-address-them">address various big data challenges</a>, such as the following:</p> <ul class="default-list"> <li>Collecting and processing a combination of unstructured, semistructured and structured data from both internal and external sources.</li> <li><a href="https://www.techtarget.com/searchdatamanagement/feature/Establish-big-data-integration-techniques-and-best-practices">Integrating different data sets</a> to give end users a comprehensive view of relevant data.</li> <li>Identifying and <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-quality-for-big-data-Why-its-a-must-and-how-to-improve-it">fixing data quality problems</a> to ensure data is accurate, consistent and trustworthy.</li> <li>Controlling storage and associated data management costs.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="Building a big data strategy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Building a big data strategy</h2> <p>Here are four steps to take when formulating a big data strategy for an organization.</p> <h3>1. Define business goals and objectives for big data applications</h3> <p>Start by defining the business objectives that the strategy aims to achieve. Businesses aren't the same, so there's no one-size-fits-all answer here. Align the strategy with corporate business objectives, critical KPIs and key business problems the company needs to address.</p> <p>Input from senior executives and business managers on business goals and needs ensures that the strategy supports their priorities -- and that the organization will adopt it. Also involve data scientists and analysts who work with the business on analytics initiatives, as well as members of the data management team.</p> <h3>2. Identify relevant data sources and assess data readiness</h3> <p>The next step is identifying useful data sources to incorporate into the strategy and assessing the readiness of their data assets. As part of the assessment, document data formats, profile data, measure quality levels and evaluate data integration and transformation requirements.</p> <p>Map data sources to the strategy's business objectives and gauge data readiness accordingly. For example, if improving CX is a business objective, the readiness assessment should cover any data assets related to customer touchpoints.</p> <h3>3. Identify and prioritize big data use cases</h3> <p>Think big on use cases, but start small when developing plans for big data applications. Be realistic about how much the data management and analytics teams can handle at once. Upfront analytics can help identify applicable -- and achievable -- use cases by uncovering patterns, correlations and other useful data insights.</p> <p>Prioritize use cases based on factors such as their <a href="https://www.techtarget.com/searchbusinessanalytics/feature/6-big-data-benefits-for-businesses">potential business benefits</a> and the required budget and resources. Depending on the number of departments and business units involved, this process can be complex. Work with the various stakeholders to create a plan, then document which use cases will be pursued so everyone is aware of the prioritized list.</p> <h3>4. Create a roadmap for big data projects</h3> <p>Plotting a roadmap for big data applications is often the most time-consuming step when building a big data strategy. Even after the roadmap is completed, it isn't set in stone. It will likely evolve over time as business objectives, priorities and opportunities change.</p> <p>As part of the roadmap, identify gaps <a href="https://www.techtarget.com/searchdatamanagement/feature/Building-a-big-data-architecture-Core-components-best-practices"></a>in data technologies, processes and skill sets that could affect the success of planned applications. The gap analysis will inform investments in the <a href="https://www.techtarget.com/searchdatamanagement/feature/Building-a-big-data-architecture-Core-components-best-practices">big data architecture</a> and the internal resources needed to support the applications. It might also prompt a review of the prioritized use cases to assess whether any changes are needed due to existing gaps that can't be filled immediately.</p> <figure class="main-article-image full-col" data-img-fullsize="https://www.techtarget.com/rms/onlineimages/datamanagement-4_steps_to_building_a%20big_data_strategy-f.png"> <img data-src="https://www.techtarget.com/rms/onlineimages/datamanagement-4_steps_to_building_a%20big_data_strategy-f_mobile.png" class="lazy" data-srcset="https://www.techtarget.com/rms/onlineimages/datamanagement-4_steps_to_building_a%20big_data_strategy-f_mobile.png 960w,https://www.techtarget.com/rms/onlineimages/datamanagement-4_steps_to_building_a%20big_data_strategy-f.png 1280w" alt="Graphic containing text that describes four key steps for building a big data strategy." height="301" width="559"> <figcaption> <i class="icon pictures" data-icon="z"></i>Organizations should follow these steps to create a big data strategy. </figcaption> <div class="main-article-image-enlarge"> <i class="icon" data-icon="w"></i> </div> </figure> </section> <section class="section main-article-chapter" data-menu-title="Be flexible when implementing the strategy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Be flexible when implementing the strategy</h2> <p>Flexibility is often the most important principle to adopt when implementing a big data strategy. Business needs, data and <a href="https://www.techtarget.com/searchdatamanagement/feature/15-big-data-tools-and-technologies-to-know-about">available tools and technologies</a> aren't static, so strategy development isn't a one-and-done exercise.</p> <p>Data leaders must be prepared to quickly adjust budgets, technologies, staffing and data management and analytics processes in response to changing circumstances. For example, IT infrastructure changes might be necessary to ensure end users can access critical data from new sources. Increasing deployments of enterprise AI applications also create new data demands that must be <a href="https://www.techtarget.com/searchenterpriseai/tip/How-do-big-data-and-AI-work-together">factored into big data strategies</a>.</p> <p>Similarly, required roles and skill sets might change over time. As a result, building strong teams to support big data initiatives typically relies on a combination of external hiring and retraining or upskilling of current employees. The balance between those approaches will likely fluctuate depending on particular staffing needs.</p> <p><b>Editor's note:</b> <i>TechTarget editors updated this article in March 2026 for timeliness and to add new information.</i></p> <p><i>Kathleen Walch is director of AI engagement and community at Project Management Institute. She previously was co-founder and managing partner of Cognilytica, a technology research and training firm acquired by PMI in 2024.</i></p> </section> Big data initiatives won't deliver business benefits without a comprehensive strategy to guide data management and analytics work. Here's how to build one. https://cdn.ttgtmedia.com/visuals/searchContentManagement/collaboration_technology/contentmanagement_article_003.jpg https://www.techtarget.com/searchdatamanagement/feature/How-to-build-an-enterprise-big-data-strategy-in-4-steps Thu, 19 Mar 2026 16:34:00 GMT How to build an effective big data strategy <p>Business leaders generally know what data they have, where it's stored and who can access it. What's less visible, but increasingly important, is how that data changed over time.</p> <p><a href="https://www.techtarget.com/searchdatamanagement/tip/How-data-lineage-tools-boost-data-governance-policies">Data lineage</a> provides visibility into a data asset's complete lifecycle, including its origin, transformations and access history. This context strengthens risk mitigation, compliance readiness and data quality efforts. It's like having every set of blueprints for a house, from the initial build through every renovation; there are no secrets or surprises. When leaders can trace how business decisions were made and how their data changed over time, they gain confidence in the tools, people and information involved in determining what should change moving forward.</p> <section class="section main-article-chapter" data-menu-title="Why businesses need data lineage"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why businesses need data lineage</h2> <p>Many organizations focus on present and future data use, but the past matters just as much. Without visibility into how data is created, changed and shared, teams struggle to maintain data quality, transparency and reliability -- all of which limit effective decision-making.</p> <p>Effective data lineage supports the following:</p> <ul class="default-list"> <li><b>Accountability.</b> Clear lineage shows who accessed or changed data, helping identify who made improvements and who introduced risks. For example, a data engineer who<a href="https://www.techtarget.com/searchdatamanagement/tip/6-dimensions-of-data-quality-boost-data-performance"> improved data quality</a> versus someone who made malicious changes.</li> <li><b>Data quality.</b> Tracing transformations identifies <a href="https://www.informationweek.com/data-management/11-irritating-data-quality-issues" target="_blank" rel="noopener">the stage where data quality dropped</a>, such as deletions, additions, deduplications or merges, allowing teams to fix issues at the source.</li> <li><b>Reduced technical debt.</b> Understanding data's original purpose highlights obsolete infrastructure or unused assets that can be safely deleted or archived.</li> <li><b>Compliance.</b> Regulatory frameworks increasingly expect traceable and documented data flows. For example, the <a href="https://www.techtarget.com/searchenterpriseai/opinion/Everything-you-need-to-know-about-the-new-EU-AI-Act">EU AI Act</a> requires organizations to trace the provenance of an AI model's training data. While more traditional regulations, such as GDPR and HIPAA, do not explicitly require data lineage, tracking the origin and lifecycle of data is critical for meeting their privacy and auditability standards.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="The risks of poor data lineage"> <h2 class="section-title"><i class="icon" data-icon="1"></i>The risks of poor data lineage</h2> <p>When businesses fail to maintain adequate visibility into data lineage, they expose themselves to financial and reputational risk. Without data lineage, organizations struggle with:</p> <ul class="default-list"> <li><b>Limited decision-making context.</b> Leaders cannot validate insights when they lack visibility of the data lifecycle. This gap can disrupt <a href="https://www.techtarget.com/searchbusinessanalytics/tip/Key-steps-form-a-data-driven-decision-making-framework">effective decision-making.</a> It's hard to have confidence in an analytics report without the historical perspective to know whether metrics were removed, filtered or dropped along the way, which can obscure growth opportunities.</li> <li><b>Erosion of stakeholder trust.</b> Unclear data origins undermine stakeholder confidence in business processes or decisions that depend on data. Earning back that trust can be a lengthy, costly process.</li> <li><b>Compliance exposure.</b> Many regulations require traceable data handling. Businesses without these records can face audits, sanctions or lawsuits.</li> <li><b>Lower data quality.</b> Data quality typically suffers when businesses lose visibility into data's origins and evolution, making it difficult to find and fix errors.</li> <li><b>Reduced IT efficiency.</b> Incomplete lineage makes it harder to <a href="https://www.techtarget.com/searchcio/feature/Key-technical-debt-reduction-strategies">manage technical debt</a> effectively, reducing IT and engineering teams' ability to focus on more innovative work.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="The AI imperative for data lineage"> <h2 class="section-title"><i class="icon" data-icon="1"></i>The AI imperative for data lineage</h2> <p>While maintaining data lineage is important for every data asset, growing adoption of <a href="https://www.techtarget.com/searchenterpriseai/tip/Agentic-AI-vs-generative-AI-Whats-the-difference">generative and agentic AI technology</a> has made it even more critical. These technologies depend on AI and machine learning models trained on vast amounts of data. Without knowing data's origins or how it evolved, troubleshooting and optimizing AI models becomes significantly more difficult.</p> <p><a href="https://www.informationweek.com/machine-learning-ai/ai-hallucinations-can-prove-costly" target="_blank" rel="noopener">Hallucinations are a big risk</a>. When a model frequently produces unreliable outputs, then lineage helps determine whether the issue is related to insufficient, missing or altered data, or a flaw in the model design.</p> <p>Beyond working with training data, data lineage also affects <a href="https://www.nojitter.com/ai-voice/managing-ai-tools-requires-more-data-transparency" target="_blank" rel="noopener">AI model deployment and management</a>. It provides visibility into prompt libraries that organizations use to guide how users interact with models. Data lineage is useful for complex, multi-step agentic workflows because it tracks how data evolves as it interacts with AI agents.</p> </section> <section class="section main-article-chapter" data-menu-title="How to achieve effective data lineage"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How to achieve effective data lineage</h2> <p>Creating effective data lineage involves the following core processes:</p> <ol class="default-list"> <li><b>Identify valuable assets.</b> Not every asset requires lineage, such as short-lived data. Focus on high-value data assets.</li> <li><b>Set lineage depth.</b> Determine the amount of detail required. Some data assets require more extensive lineage than others.</li> <li><b>Record data events.</b> After establishing lineage goals, organizations should record all data-related events that fall within the scope of their plans, including data creation, transformation and changes to access policies.</li> <li><b>Validate lineage. </b>Periodically audit lineage records to detect and correct visibility gaps or enforcement oversights.</li> <li><b>Adapt over time.</b> Revisit lineage needs and strategy regularly and update practices as needed.</li> </ol> <p>Automated lineage tools help track data history, but they require human oversight to ensure accuracy. Organizations often assign data stewards to this role, and data engineers and <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-roles-and-responsibilities-Whats-needed">data governance teams</a> typically share responsibility to maintain lineage quality.</p> <p><em>Chris Tozzi is a freelance writer, research adviser, and professor of IT and society who has previously worked as a journalist and Linux systems administrator.</em></p> </section> Data lineage is critical to enterprise success. Tracing a data asset's journey through pipelines helps improve data quality, speeds up corrections and builds trust in analytics. https://cdn.ttgtmedia.com/visuals/searchDataManagement/content_management/datamanagement_article_003.jpg https://www.techtarget.com/searchdatamanagement/tip/Data-lineage-documentation-imperative-to-data-quality Wed, 18 Mar 2026 10:00:00 GMT Data lineage documentation matters for enterprise reliability <p>Open source databases have steadily gained ground on proprietary systems, driven by the rise of Linux, cloud computing and NoSQL technologies. Open source databases and source-available alternatives provide organizations with options beyond proprietary platforms.</p> <p>Open source software offers user organizations the promise of source code developed in the open, typically in a community-driven process. The aim is to expand the number of people involved in the development process and not lock users into a specific vendor's technology.</p> <section class="section main-article-chapter" data-menu-title="What are open source databases?"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What are open source databases?</h2> <p><a name="_Hlk164960278"></a>Open source databases are developed and released under an open source license. While <i>open source</i> is sometimes used as a marketing term, it has a very specific definition when it comes to software licenses. To qualify as open source, a database must use a license approved by the Open Source Initiative. The OSI determines whether licenses adhere to the Open Source Definition (OSD), which is the <a href="https://opensource.org/osd">guiding document</a> for open source licensing.</p> <p>Licensing has become more complicated, though. A growing number of vendors that created open source databases have adopted licenses that largely adhere to the tenets of the OSD but aren't OSI-approved. In many cases, these licenses require cloud providers offering database as a service (DBaaS) implementations to release modified or related source code under the same license. These are typically called source-available licenses. Databases licensed this way are often grouped with fully open source technologies as alternatives to proprietary, closed source databases.</p> <p>The broad category of open source and source-available databases contains <a href="https://www.techtarget.com/searchdatamanagement/feature/Evaluating-the-different-types-of-DBMS-products">various types of database software</a> that support different applications. That includes SQL-based <a href="https://www.techtarget.com/searchdatamanagement/definition/relational-database">relational databases</a>, the most widely used type, and the four primary NoSQL technologies -- key-value stores, document databases, wide-column stores and <a href="https://www.techtarget.com/whatis/definition/graph-database">graph databases</a>. Open source versions of special-purpose systems, such as vector databases and time series databases, are also available. In addition, many vendors now offer databases that support more than one data model.</p> </section> <section class="section main-article-chapter" data-menu-title="Potential benefits of using open source databases"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Potential benefits of using open source databases</h2> <p>Open source databases <a href="https://www.techtarget.com/searchdatamanagement/feature/Top-open-source-database-advantages-for-enterprises">offer many potential benefits</a>, some of which also apply to source-available technologies. The following are among the primary benefits for user organizations:</p> <ul class="default-list"> <li><b>Easy to get started. </b>A core premise of the open source approach is that the technology is freely available. As a result, users can easily try out and deploy an open source database without first having to pay for it. Vendors do offer paid support as well as closed source versions of databases with additional features in many cases.</li> <li><b>Community support and engagement.</b> Open source or source-available code typically comes with a community of engaged users and contributors who can help new users with the technology. It also enables a degree of participation in the code development process. For example, users can submit bug reports and feature requests and become contributors themselves.</li> <li><b>Source code transparency.</b> When source code is open and can be viewed by anyone, there's a better chance of understanding how a database works and how it can be used effectively to meet business needs.</li> <li><b>Flexibility and customization.</b> With some open source licenses, developers are free to modify the database software to meet specific custom requirements.</li> <li><b>Improved security.</b> Because the source code is open, developers, users and security researchers can thoroughly scrutinize it to identify vulnerabilities. That enables <a href="https://www.techtarget.com/searchsecurity/tip/5-enterprise-patch-management-best-practices">rapid patching of vulnerabilities</a> after they're discovered.</li> </ul> <p>The technologies listed below are among the most prominent open source and source-available databases. TechTarget editors compiled the list based on research into the database market, including Gartner vendor rankings and database management system (<a href="https://www.techtarget.com/searchdatamanagement/definition/database-management-system">DBMS</a>) popularity rankings from DB-Engines. However, the list itself is unranked and includes five relational open source databases, three NoSQL ones and four source-available technologies, organized in that order.</p> <p>Each write-up outlines key features, potential use cases, licensing and commercial support options to help organizations choose the right database for their application needs.</p> </section> <section class="section main-article-chapter" data-menu-title="1. MySQL"> <h2 class="section-title"><i class="icon" data-icon="1"></i>1. MySQL</h2> <p>MySQL is among the most widely deployed open source databases. It was first released in 1996 as an independent effort led by Michael "Monty" Widenius and two other developers, who co-founded MySQL AB to create the database. The company was acquired in 2008 by Sun Microsystems, which was then bought by Oracle in 2010. MySQL has remained a core part of Oracle's database portfolio ever since while being maintained as open source software.</p> <p>A relational database, MySQL was originally positioned as an online transaction processing (OLTP) system and is still primarily geared to transactional uses, although Oracle's MySQL HeatWave cloud database service now also supports analytics and <a href="https://www.techtarget.com/searchenterpriseai/definition/machine-learning-ML">machine learning</a> applications. MySQL gained much of its early popularity as a cornerstone of the LAMP stack of open source technologies -- Linux, Apache, MySQL and PHP, Perl or Python -- that powered the first generation of web development. It continues to be an underlying database on many websites today.</p> <p><b>Common use cases: </b>Like other relational databases, MySQL complies with the <a href="https://www.techtarget.com/searchdatamanagement/definition/ACID">ACID</a> properties -- atomicity, consistency, isolation and durability – to ensure data integrity and reliability. As a result, it supports a broad range of applications. For example, MySQL is commonly used as a web application server and to run cloud applications and content management systems.</p> <p><b>Licensing: </b>MySQL is dual-licensed under the GPL version 2 open source license and an Oracle license for organizations looking to distribute the database along with commercial applications.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/mysql/mysql-server" rel="noopener">https://github.com/mysql/mysql-server</a></p> <p><b>Commercial support options: </b>There are numerous commercial implementations of MySQL. Oracle offers multiple options beyond MySQL Heatwave, including Enterprise and Standard editions and an embedded version. MySQL is also available in the cloud as part of the <a href="https://www.techtarget.com/searchaws/definition/Amazon-Relational-Database-Service-RDS">Amazon Relational Database Service (RDS)</a> from AWS, as well as Google's Cloud SQL and Microsoft's Azure Database services. Vendors such as Aiven, PlanetScale, and Percona also offer MySQL cloud services.</p> </section> <section class="section main-article-chapter" data-menu-title="2. MariaDB"> <h2 class="section-title"><i class="icon" data-icon="1"></i>2. MariaDB</h2> <p>MariaDB debuted in 2009 as a fork of MySQL that was created by a team also led by Widenius, who left Sun early that year because he was concerned about the direction and development of MySQL. Work on MariaDB started when he was still at Sun, and it was originally designed to be a drop-in replacement for MySQL. But that was only fully true until the 5.5 releases of the two databases. After that, new features not in MySQL were added to MariaDB, which used different numbering on subsequent releases.</p> <p>Even with newer updates, though, it's still relatively easy to migrate from MySQL to MariaDB. The latter's data files are generally binary-compatible with MySQL's, and the database client protocols are also compatible. As a result, users in many cases can simply uninstall MySQL and install MariaDB to change between them. MariaDB PLC, which leads development of the software through the MariaDB Foundation, maintains a list of <a target="_blank" href="https://mariadb.com/kb/en/mariadb-vs-mysql-compatibility/" rel="noopener">incompatibilities and feature differences</a> with MySQL.</p> <p><b>Common use cases:</b> MariaDB is commonly used for the same purposes as MySQL, including in web and cloud applications involving both transaction processing and analytics workloads.</p> <p><b>Licensing:</b> The free MariaDB Server software -- referred to by the company as MariaDB Community Server -- is released under the GPLv2 license.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/MariaDB/server" rel="noopener">https://github.com/MariaDB/server</a></p> <p><b>Commercial support options</b><b>:</b> MariaDB PLC sells a MariaDB Enterprise Server version of the database with enterprise features such as <a href="https://www.theserverside.com/definition/JSON-Javascript-Object-Notation">JSON</a> support and columnar storage. MariaDB Cloud is the company's fully managed DBaaS offering. MariaDB is also available through Amazon RDS, while Microsoft retired Azure Database for MariaDB in September 2025.</p> </section> <section class="section main-article-chapter" data-menu-title="3. PostgreSQL"> <h2 class="section-title"><i class="icon" data-icon="1"></i>3. PostgreSQL</h2> <p>PostgreSQL got its start as Postgres in 1986 at the University of California, Berkeley. The Postgres project was initiated by relational database pioneer Michael Stonebraker, then a professor at the school, as a more advanced alternative to Ingres, a proprietary relational database management system (<a href="https://www.techtarget.com/searchdatamanagement/definition/RDBMS-relational-database-management-system">RDBMS</a>) that he also played a lead role in developing. The software became open source in 1995, when a SQL language interpreter was also added, and it was officially renamed PostgreSQL in 1996. Decades later, though, PostgreSQL and Postgres are still used interchangeably by developers, vendors and users to refer to the database.</p> <p>PostgreSQL offers full RDBMS features, including ACID compliance, SQL querying and support for procedural language queries to create stored procedures and triggers in databases. Like MySQL, MariaDB and many other database technologies, it also supports multiversion concurrency control (<a href="https://www.theserverside.com/blog/Coffee-Talk-Java-News-Stories-and-Opinions/What-is-MVCC-How-does-Multiversion-Concurrencty-Control-work">MVCC</a>) so data can be read and updated by different users at the same time. In addition, PostgreSQL supports other types of database objects than standard relational tables, and it's described as an object-relational DBMS on the open source project's website.</p> <p><b>Common use cases: </b>PostgreSQL is commonly positioned as an open source alternative to the proprietary Oracle Database. It's widely used to support enterprise applications that require complex transactions and high levels of concurrency, and sometimes in data warehousing operations.</p> <p><b>Licensing</b>: The software is available under the OSI-approved PostgreSQL License.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://git.postgresql.org/gitweb/?p=postgresql.git;a=summary" rel="noopener">https://git.postgresql.org/gitweb/?p=postgresql.git;a=summary</a></p> <p><b>Commercial support options:</b> PostgreSQL has a wide range of commercial support and managed cloud offerings. EDB specializes in PostgreSQL and provides both self-managed and DBaaS versions in the cloud. Managed PostgreSQL cloud services are also available from <a href="https://www.techtarget.com/searchdatamanagement/tip/Cloud-database-comparison-AWS-Microsoft-Google-and-Oracle">AWS, Google, Microsoft and Oracle</a>, as well as vendors such as Aiven, Percona and Instaclustr.</p> </section> <section class="section main-article-chapter" data-menu-title="4. Firebird"> <h2 class="section-title"><i class="icon" data-icon="1"></i>4. Firebird</h2> <p>The Firebird open source relational database's technology roots go back to the early 1980s, when the proprietary InterBase database was created. After InterBase was acquired by multiple vendors, commercial product development ended and the final release was made available under an open source license in 2000. Within a week, the Firebird project was created to continue developing a fork of the technology.</p> <p>Firebird supports ACID-compliant transactions, external user-defined functions and various standard SQL features, and it includes a multi-generational architecture that provides MVCC capabilities. The software has a relatively small footprint and is available in an embedded single-user version, but it can also be used to run multi-terabyte databases with hundreds of concurrent users. It shouldn't be confused with Firestore and Firebase Realtime Database, two commercial NoSQL databases developed by Google.</p> <p><b>Common use cases:</b> Firebird can handle both operational and analytics applications. It's used in various types of enterprise applications, including ERP and CRM systems.</p> <p><b>Licensing</b><b>:</b> Firebird is made available under the InterBase Public License (IPL) and the Initial Developer's Public License (IDPL). Both are variants of the Mozilla Public License Version 1.1, which is OSI-approved though now superseded by Version 2.0. The IPL covers the source code from InterBase, while the IDPL applies to added or improved code developed as part of the Firebird project.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/FirebirdSQL/firebird" rel="noopener">https://github.com/FirebirdSQL/firebird</a></p> <p><b>Commercial support options: </b>Firebird is an independent open source project, and the software is free to use, including for commercial purposes. The Firebird project points users to commercial providers for technical support, consulting and training services. Firebird cloud services running on Windows Server 2019 are available for purchase in the AWS, Azure and Google clouds, although support ended on Google Cloud one in August 2024.</p> </section> <section class="section main-article-chapter" data-menu-title="5. SQLite"> <h2 class="section-title"><i class="icon" data-icon="1"></i>5. SQLite</h2> <p>SQLite is a lightweight embedded RDBMS that runs inside applications. It was created in 2000 by computer analyst and programmer D. Richard Hipp while he was working as a government contractor supporting a U.S. Navy project, which needed a database that could run without a database administrator (<a href="https://www.techtarget.com/searchdatamanagement/definition/database-administrator">DBA</a>) in environments with minimal resources. Hipp continues to lead development of the software as project architect through Hipp, Wyrick & Company Inc., a software engineering firm commonly known as Hwaci.</p> <p>As an embedded database, SQLite is self-contained, meaning it's fully functional within the application it powers. The software is a library that embeds a full-featured SQL database engine supporting ACID transactions. There are no separate database server processes. Data reads and writes are done directly to ordinary disk files, and a complete SQLite database that includes tables, indices, triggers and views can be contained in a single file.</p> <p><b>Common use cases:</b> SQLite is commonly used in mobile applications, web browsers and IoT devices due to its small footprint and ability to operate without a separate server process.</p> <p><b>Licensing:</b> The SQLite source code is in the public domain and is free to use, modify and distribute for any purpose without a license. Hwaci does sell a warranty of title with a perpetual right-to-use license to organizations that want one for legal reasons.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://sqlite.org/src/doc/trunk/README.md" rel="noopener">https://sqlite.org/src/doc/trunk/README.md</a></p> <p><b>Commercial support options:</b> Hwaci provides paid technical support, maintenance and testing services, and it offers a set of proprietary extensions to SQLite that are sold under separate licenses. As with Firebird, SQLite database services are available on AWS, Azure and Google Cloud, though the underlying marketplace images vary by platform.</p> </section> <section class="section main-article-chapter" data-menu-title="6. Apache Cassandra"> <h2 class="section-title"><i class="icon" data-icon="1"></i>6. Apache Cassandra</h2> <p>The Cassandra wide-column store traces its roots back to 2007, when Facebook developed it to support a new inbox search feature. The NoSQL database was converted to open source in 2008 and became part of the Apache Software Foundation in 2009, initially as an incubator project before it was elevated to top-level project status the following year.</p> <p>Cassandra is a fault-tolerant distributed database that can be used to store and manage large amounts of data across a cluster consisting of numerous commodity servers. The software <a href="https://www.techtarget.com/searchdisasterrecovery/definition/data-replication">replicates data</a> across multiple server nodes to avoid single points of failure, and it can be scaled dynamically by adding more servers to a cluster as processing demand increases. Cassandra is designed around distributed scalability and <a href="https://www.techtarget.com/searchdatacenter/definition/high-availability">high availability</a>, traditionally with eventual consistency tradeoffs, though Apache Cassandra 5.0 has added transaction capabilities that expand its support for transactional workloads.</p> <p><b>Common use cases:</b> Cassandra is designed for uses that require fast performance, scalability and high availability. It's deployed for various applications, including inventory management, e-commerce, social media analytics, messaging systems and telecommunications, among others.</p> <p><b>Licensing:</b> The Cassandra software is covered by the Apache License 2.0.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/apache/cassandra/tree/trunk" rel="noopener">https://github.com/apache/cassandra/tree/trunk</a></p> <p><b>Commercial support options</b><b>:</b> Multiple vendors provide commercial support for Cassandra and DBaaS versions of the database, including DataStax, Aiven and Instaclustr. Amazon Keyspaces (for Apache Cassandra) and Azure Managed Instance for Apache Cassandra are also available as database services from AWS and Microsoft, respectively.</p> </section> <section class="section main-article-chapter" data-menu-title="7. Apache CouchDB"> <h2 class="section-title"><i class="icon" data-icon="1"></i>7. Apache CouchDB</h2> <p>CouchDB is a NoSQL document database first released in 2005 by software engineer Damien Katz that became an Apache project in 2008. The <i>Couch</i> part of the name is an acronym for "cluster of unreliable commodity hardware," which stems from the project's original goal: to create a reliable database system that could run efficiently on ordinary hardware. CouchDB can be deployed on a single server node, but also as a single logical system across multiple nodes in a cluster, which can be scaled as needed by adding more servers.</p> <p>The database uses JSON documents to store data and JavaScript as its query language. Other key features include support for MVCC and the ACID properties in individual documents, although an eventual consistency model is used for data stored on multiple database servers -- a tradeoff that prioritizes availability and performance over absolute data consistency. Data is synchronized across servers through an incremental replication feature that can be set up for bidirectional tasks and used to support mobile apps and other offline-first applications.</p> <p><b>Common use cases:</b> CouchDB is used for various purposes, including data analytics, time series data storage and mobile applications that require offline storage and functionality.</p> <p><b>Licensing: </b>CouchDB is licensed under the Apache License 2.0.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/apache/couchdb" rel="noopener">https://github.com/apache/couchdb</a></p> <p><b>Commercial support options: </b>The IBM Cloudant cloud database is based on CouchDB with added open source technology that supports full-text search and geospatial indexing. Several other companies also offer support for CouchDB, including packaged instances in the AWS, Azure and Google clouds.</p> </section> <section class="section main-article-chapter" data-menu-title="8. Neo4j"> <h2 class="section-title"><i class="icon" data-icon="1"></i>8. Neo4j</h2> <p>Neo4j is a NoSQL graph database that's well-suited for representing and querying highly connected data sets. Neo4j uses a property graph database model consisting of nodes, which represent individual data entities, and relationships -- also referred to as edges -- that define how different nodes are organized and connected. Nodes and relationships can also include properties, or attributes, in the form of key-value pairs that further describe them.</p> <p>First released as open source software in 2007, Neo4j is overseen by database vendor Neo4j Inc. It originally was a solely Java-based graph database, but has since been expanded to include additional capabilities, including <a href="https://www.techtarget.com/searchdatamanagement/feature/Vector-search-now-a-critical-component-of-GenAI-development">vector search and data storage</a>. Key features include full ACID compliance, horizontal scalability through an Autonomous Clustering architecture and the Cypher query language. Neo4j Inc. is aligning Cypher with GQL, the ISO graph query language published in April 2024 that uses syntax based on both SQL and Cypher, while continuing to support Cypher extensions not yet included in the standard.</p> <p><b>Common use cases:</b> Typical uses for Neo4j include social networking, recommendation engines, network and IT operations, fraud detection, and supply chain management, with <a href="https://www.techtarget.com/searchenterpriseai/definition/generative-AI">generative AI</a> applications also now supported through the vector search feature.</p> <p><b>Licensing</b><b>:</b> Neo4j Community Edition is licensed under the GPL version 3. An open source version of Cypher named openCypher is also available under the Apache License 2.0.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/neo4j/neo4j" rel="noopener">https://github.com/neo4j/neo4j</a></p> <p><b>Commercial support options:</b> Neo4j Inc. provides several supported commercial offerings, including  Neo4j Enterprise Edition with added closed source components and the subscription-based Neo4j AuraDB cloud service.</p> </section> <section class="section main-article-chapter" data-menu-title="9. Couchbase Server"> <h2 class="section-title"><i class="icon" data-icon="1"></i>9. Couchbase Server</h2> <p>Couchbase Server is a NoSQL document database with multimodel capabilities for storing data both in JSON documents and as key-value pairs. The technology resulted from the 2011 merger of two open source database companies: CouchOne, which had been founded by CouchDB creator Damien Katz to offer systems based on that database, and Membase, which was set up to build a key-value store by developers of the memcached distributed caching technology. The combined company became Couchbase, leading to the development of Couchbase Server.</p> <p>Despite their similar names and partly shared origins, Couchbase Server and CouchDB aren't directly related or compatible -- they're different database technologies with their own code and APIs. Couchbase Server supports strong consistency, distributed ACID transactions and SQL++, a SQL-like language for querying JSON data. It also includes vector and full-text search, plus a multidimensional scaling feature that enables different database functions to be isolated and independently scaled based on workload demands.</p> <p><b>Common use cases</b><b>:</b> Couchbase Server is often used to support distributed application workloads and for mobile, edge and IoT applications.</p> <p><b>Licensing:</b> Originally available under the Apache License 2.0, Couchbase Server was switched in 2021 to the Business Source License (BSL) 1.1, a source-available license that restricts commercial use of the software by other vendors. Database releases are converted back to the Apache open source license four years after they become available.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/couchbase/manifest" rel="noopener">https://github.com/couchbase/manifest</a></p> <p><b>Commercial support options:</b> Couchbase offers an enterprise edition of Couchbase Server for cloud and on-premises deployments, as well as a mobile version of the database and a fully managed DBaaS technology named Couchbase Capella.</p> </section> <section class="section main-article-chapter" data-menu-title="10. MongoDB"> <h2 class="section-title"><i class="icon" data-icon="1"></i>10. MongoDB</h2> <p><a href="https://www.techtarget.com/searchdatamanagement/definition/MongoDB">MongoDB</a> is another NoSQL document database that was initially developed as open source software and is now a source-available technology. First released in 2009, MongoDB stores data in a JSON-like document format called BSON, which is short for Binary JSON. As the name indicates, BSON encodes data in a binary structure that's designed to support more data types and faster indexing and querying performance than JSON.</p> <p>The database is often seen as an attractive option for developers who want to build applications without the constraints of a fixed schema. In addition to its document data model, MongoDB includes native support for graph, geospatial and time series data. MongoDB Atlas, a cloud database service offered by lead developer MongoDB Inc., also provides vector and full-text search features that can be used free of charge for development and testing in local environments. Other key features in MongoDB include multi-document ACID transactions, sharding for horizontal scalability and automatic load balancing.</p> <p><b>Common use cases:</b> MongoDB is widely deployed for uses that include AI, edge computing, IoT, mobile, payment and gaming applications, as well as website personalization, content management and product catalogs.</p> <p><b>Licensing:</b> Since 2018, new versions of MongoDB Community Server and patches for previous releases have been made available under the Server Side Public License (SSPL) Version 1, a source available license created by MongoDB Inc.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/mongodb/mongo" rel="noopener">https://github.com/mongodb/mongo</a></p> <p><b>Commercial support options: </b>In addition to MongoDB Atlas, MongoDB Inc. offers a self-managed MongoDB Enterprise Server that also provides additional capabilities beyond what's in the community edition. MongoDB support and managed services are also available from vendors such as Datavail and Percona. Amazon DocumentDB (with MongoDB compatibility) is a fully managed DBaaS offering from AWS that supports MongoDB versions 6.0, 7.0 and 8.0 as of November 2025.</p> </section> <section class="section main-article-chapter" data-menu-title="11. Redis"> <h2 class="section-title"><i class="icon" data-icon="1"></i>11. Redis</h2> <p>Redis is a NoSQL in-memory database that was converted to a source-available technology in March 2024. The Redis project was created in 2009 by software programmer Salvatore Sanfilippo, known by the nickname “antirez,” to help solve a database scaling problem with a real-time website log analysis tool. Short for Remote Dictionary Server, Redis originally was positioned as software that provided a key-value data store as a caching technology to accelerate existing databases and application workloads.</p> <p>The database caching functionality remains the foundation of Redis, with features that include built-in replication, on-disk data persistence and support for complex data types. But the platform has been expanded to include additional capabilities, such as support for <a href="https://www.techtarget.com/searchdatamanagement/news/252514648/Redis-launches-JSON-database-capabilities-with-RedisJSON-20">storing JSON documents </a>and both vector and time series data. A graph database module was also added, but lead developer Redis Inc. stopped developing it in 2023.</p> <p><b>Common use cases:</b> While Redis can be used as a full database, one of its most common uses is still as a database query caching layer. It's also often used to support real-time notifications through an integrated pub/sub capability and as a session store to help manage user sessions for web and mobile applications.</p> <p><b>Licensing</b><b>:</b> As of March 2026, Redis licensing now varies by version. Redis 8 is available under a tri-license of the Redis Source Available License 2.0, SSPL v1 and AGPLv3. Earlier Redis Community Edition releases from 7.4 through 7.8 were dual-licensed under RSALv2 and SSPLv1, while Redis 7.2 and earlier were released under the BSD 3-Clause license.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/redis/redis" rel="noopener">https://github.com/redis/redis</a></p> <p><b>Commercial support options:</b> Redis offers Redis Software for self-managed deployments and Redis Cloud as its fully managed service. Microsoft offers Azure Cache for Redis as a managed service, but has announced a retirement timeline for all SKUs by 2028 and recommends migration to Azure Managed Redis. Redis managed services are also available from Aiven and Instaclustr, while AWS, Google and Oracle offer cloud services with Redis compatibility.</p> </section> <section class="section main-article-chapter" data-menu-title="12. CockroachDB"> <h2 class="section-title"><i class="icon" data-icon="1"></i>12. CockroachDB</h2> <p>CockroachDB is a source-available distributed SQL database loosely inspired by Google's proprietary Spanner database. Developed primarily by vendor Cockroach Labs, CockroachDB was first released in 2015, with an initial production version appearing two years later. Just like the insect it's named after, a core design goal of CockroachDB is to be hard to kill. The cloud-native database is built to be a fault-tolerant, resilient and consistent <a href="https://www.techtarget.com/searchdatamanagement/definition/data-management">data management</a> platform.</p> <p>CockroachDB scales horizontally and can survive various types of equipment failures with minimal disruptions to users and no manual intervention required by DBAs, according to its developers. Key features include automated repair and recovery, support for ACID transactions with strong consistency, a SQL API and geo-partitioning of data to boost application performance. It also has a "multi-active availability" model that enables users to read and write data from any cluster node with no conflicts.</p> <p><b>Common use cases: </b>CockroachDB is well suited for high-volume OLTP applications and distributed database deployments across multiple data centers and geographic regions.</p> <p><b>Licensing</b><b>:</b> Since 2019, most of CockroachDB's core features have been licensed under a version of the BSL that requires other vendors to purchase a license from Cockroach Labs if they want to offer a commercial database service. Other core features are covered by the Cockroach Community License (CCL), which allows source code to be viewed and modified but not reused without an agreement on that with Cockroach Labs. The features licensed under the BSL convert to the Apache License 2.0 and become open source 3 years after a new database release, a change that doesn't apply to the CCL-licensed features.</p> <p><b>Source code repository</b><b>:</b> <a target="_blank" href="https://github.com/cockroachdb/cockroach" rel="noopener">https://github.com/cockroachdb/cockroach</a></p> <p><b>Commercial support options:</b> Cockroach Labs provides technical support and additional paid enterprise features that are available in both self-managed and DBaaS deployments.</p> <p><b>Editor's note:</b> This article was originally published in May 2024 and republished in March 2026 to include product updates and improve the reader experience.</p> <p><i>Sean Michael Kerner is an IT consultant, technology enthusiast and tinkerer. He has pulled Token Ring, configured NetWare and been known to compile his own Linux kernel. He consults with industry and media organizations on technology issues.</i></p> </section> Open source databases offer viable alternatives to proprietary systems. This guide offers 12 popular options across relational, NoSQL and source-available technologies. https://cdn.ttgtmedia.com/rms/onlineimages/code_g1255337870.jpg https://www.techtarget.com/searchdatamanagement/feature/Top-open-source-databases-to-consider Wed, 18 Mar 2026 09:00:00 GMT 12 top open source databases for enterprise use <p>A one‑size‑fits‑all approach to AI models in data-driven applications risks leaving enterprises paying more, waiting longer for analytics results and taking on avoidable business risk.  </p> <p>For data and analytics leaders, the era of "bigger is better" in AI applications is giving way to right-sizing strategies that assign small language models (SLMs) to narrow, repeatable tasks and reserve large language models (LLMs) for complex reasoning workloads. The <a href="https://www.techtarget.com/searchenterpriseai/tip/Why-small-language-models-are-on-the-rise">advantages of SLMs</a> include easier governance, lower run-rate costs and the ability to run on-premises, which is particularly attractive to regulated industries where compliance and data privacy are paramount. Analyst outlooks and vendor guidance show organizations increasingly moving toward this mixed-model configuration.</p> <p>"There is a space for smaller models to be utilized more," said FTI Consulting managing director and chief data scientist Dimitris Korres. He noted that organizations "can use a mix of large and small models to reduce costs and get better, faster outputs for certain tasks."</p> <div class="youtube-iframe-container"> <iframe id="ytplayer-0" src="https://www.youtube.com/embed/AlwWuSor_M4?autoplay=0&modestbranding=1&rel=0&widget_referrer=null&enablejsapi=1&origin=https://www.techtarget.com" type="text/html" height="360" width="640" frameborder="0"></iframe> </div> <section class="section main-article-chapter" data-menu-title="Small models are on the rise"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Small models are on the rise</h2> <p>This <a href="https://www.techtarget.com/searchenterpriseai/opinion/Small-language-models-are-taking-the-spotlight">move to rightsize models</a> to make them fit for purpose rather than simply using the largest model emerged as a viable strategy in 2024. It got increased attention when Nvidia researchers published a September 2025 paper titled <i>Small Language Models are the Future of Agentic AI.</i></p> <p>In it, they argued that SLMs are "sufficiently powerful, inherently more suitable, and necessarily more economical" for many agentic AI workloads and recommended the use of heterogeneous agents that send low-level tasks to an SLM and only escalate more advanced work to an LLM.</p> <p>Interest in SLMs is still relatively limited, as most organizations remain focused on how to use the dominant LLMs to transform their workflows, products and services.</p> <p>Yet, there is evidence of increasing use of SLMs, signifying a strategic shift toward task‑specific models. Gartner projects that by 2027, organizations will use SLMs three times more than general‑purpose LLMs. A 2025 SNS Insider report valued the global SLM market at $7.9 billion in 2023, predicting it will reach $29.64 billion by 2032.</p> </section> <section class="section main-article-chapter" data-menu-title="Map tasks to the model size"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Map tasks to the model size</h2> <p>LLMs have tens of billions to hundreds of billions of parameters. This scale makes LLMs valuable, enabling them to work through <a href="https://www.techtarget.com/searchenterpriseai/tip/How-to-evaluate-LLMs-for-enterprise-use-cases">large numbers of complex, broad tasks</a>. But LLMs require massive amounts of compute power, making them resource-intensive. Due to their scale, many enterprises use LLMs as an endpoint from a cloud provider rather than deploying them internally.</p> <p>SLMs typically max out at 10 billion parameters and usually have far less -- in the millions in many cases. The smaller size of SLMs often makes them faster to serve up results and more feasible to deploy on-premises or in <a href="https://www.techtarget.com/searchcio/tip/Top-edge-computing-trends-to-watch-in-2020">edge computing environments</a> for specialized workloads.</p> <p>SLMs have several advantages compared to LLMs:</p> <ul class="default-list"> <li><b>Fit for purpose.</b> Due to their size and speed, SLMs are ideal for simpler work, such as document sorting and summarizing meeting notes.</li> <li><b>Better responsiveness.</b> Smaller models can return answers faster on focused tasks because they require fewer computations.</li> <li><b>Lower costs.</b> Nvidia claims AI inference can be 10 to 30 times less expensive with SLMs than LLMs if the hardware is fully utilized. Using SLMs can also help avoid the higher energy consumption and costs associated with LLMs.</li> <li><b>Private by design.</b> SLMs are especially suited for working with sensitive data -- for example, processing legal contracts, medical records or financial data -- because they can run on in-house servers or edge devices. This keeps the data within the organization's environment rather than in a public cloud.</li> <li><b>Consistent results.</b> SLMs are generally easier to adjust for predictable outputs.</li> </ul> <p>For data leaders, the takeaway is simple: Start small and then move up to an LLM when needed. This approach helps companies maintain service quality while keeping runtime costs predictable. For high‑volume automation, using SLMs provides steadier spending than running large, cloud‑dependent models.</p> <p>Such factors have prompted Zach Rossmiller, CIO at the University of Montana, to consider where smaller models might be a better fit than LLMs.</p> <p>"It comes down to use cases for us," he said.</p> <p>He has looked at using SLMs for workloads that <a href="https://www.techtarget.com/searchenterpriseai/tip/Address-top-AI-privacy-concerns-with-this-checklist">keep sensitive data protected</a> in on-premises systems and to support a possible digital tutoring service running on a Raspberry Pi or other end-user devices for students in rural areas that lack reliable high-speed internet access.</p> <p>Rossmiller said SLMs seem like a good choice for specific use cases like those, where factors such as data privacy and connectivity concerns may prohibit the use of LLMs.</p> </section> <section class="section main-article-chapter" data-menu-title="Too many small models might lead to slowdowns"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Too many small models might lead to slowdowns</h2> <p>Holger Mueller, principal analyst and vice president at Constellation Research, is more skeptical about SLMs. He doesn't see much value in them, saying that users will quickly reach their limits and ask, "That's it? That's all it can do?"</p> <p>Mueller said relying on too many SLMs will also push users into an "integrator" role in which they must find ways to coordinate and integrate multiple small models, which could erase the savings in cost and time. He added that another potential disadvantage is the need for specific <a href="https://www.techtarget.com/searchdatamanagement/tip/Experts-share-practices-to-overcome-AI-data-readiness">training on company data for the SLMs to be effective</a>.</p> <p>SLMs only seem to be advantageous in a handful of rare circumstances where the scenarios are narrow in scope and remain that way, Mueller said.</p> <p>However, he said if more LLM developers start adding small models to their offerings, "that would be something to pay attention to."</p> </section> <section class="section main-article-chapter" data-menu-title="Consider the architectural realities for a mixed-model platform"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Consider the architectural realities for a mixed-model platform</h2> <p>While SLMs can run on-premises, their usefulness depends on several factors.</p> <p>Most organizations will continue to rely on LLMs, which require data platforms to support secure, high-speed transfers to cloud AI platforms. However, adding task-focused SLMs might require additional investments to build a distributed, hybrid architecture that keeps sensitive data where it already resides to <a href="https://www.techtarget.com/searchenterpriseai/tip/How-to-navigate-data-sovereignty-for-AI-compliance">meet compliance and privacy requirements</a> and lower data transfer costs.</p> <p>Nvidia's researchers said enterprises might need to tailor their ecosystems toward AI agents that can automatically route tasks to the right model.</p> <p>"Cloud is always an option, but there are also self-hosting endpoints that have ridiculously low costs," said Korres. "And if a project requires more control, or isolation of the data flow or data residency, then smaller models can definitely be hosted on local devices."</p> </section> <section class="section main-article-chapter" data-menu-title="On-premises models offer privacy and predictability"> <h2 class="section-title"><i class="icon" data-icon="1"></i>On-premises models offer privacy and predictability</h2> <p>Because they are smaller and more focused, SLMs can be easier to manage from a governance, risk and compliance perspective, especially if they run on-premises or in a private cloud. The narrow scope of SLMs helps with transparency, monitoring and auditability, reducing the risk of data exposure compared to large, public cloud models.</p> <p>Furthermore, SLMs running locally can improve reproducibility, meaning "that anyone else can follow the same steps and get the same result," said Jonathan Chang, assistant professor of computer science at Harvey Mudd College. In contrast, reproducibility is not guaranteed with models hosted in public clouds.</p> <p>Korres also noted that because SLMs are often used for narrowly defined tasks, organizations can more easily <a href="https://www.techtarget.com/searchsoftwarequality/tip/Benchmarking-LLMs-A-guide-to-AI-model-evaluation">evaluate their performance than when using a general-purpose LLM</a>.</p> <p>However, some risks are similar regardless of the model size.</p> <p>"Models small and large will hallucinate. They can produce outputs that could be misleading. They could leak training data that could mean privacy concerns," said Chang.</p> <p><iframe title="SLMs vs. LLMs: At a glance" aria-label="Table" id="datawrapper-chart-XGxUf" src="https://datawrapper.dwcdn.net/XGxUf/2/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="638" data-external="1"></iframe></p> <p> <script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script> </p> </section> <section class="section main-article-chapter" data-menu-title="Decision rules for cost vs. performance"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Decision rules for cost vs. performance</h2> <p>As many enterprise technology executives are finding, the costs associated with LLM use can be high and unpredictable, Chang said.</p> <p>Additionally, he pointed out the environmental impact of LLMs.</p> <p>"Bigger models require more resources to run. They take more energy, so they emit more carbon," Chang said.</p> <p>By contrast, SLMs provide lower, more predictable operational costs with less impact on the environment, he said. However, gaining those advantages requires having the <a href="https://www.techtarget.com/searchstorage/opinion/IT-leaders-face-data-infrastructure-gaps-as-AI-workloads-grow">right infrastructure to run the models.</a> The financial benefits of running SLMs locally might dissipate if the organization needs to set up a new data center or make other IT investments.</p> <p>Enterprises need to fully consider the strategic use of SLMs to make sure the model choice fits the task.</p> <p>"If it is a well-defined task and it is narrow enough, I see more benefits than risks to using a smaller model," said Korres. <b>"</b>But if the function is quite general, or you need something that can manage agents or define things on the fly, that's where larger models make a difference."</p> <p>Other experts had similar assessments, noting that the workload should <a target="_blank" href="https://mitsloan.mit.edu/ideas-made-to-matter/looking-ahead-ai-and-work-2026" rel="noopener">determine</a> whether to go with a small or large model.</p> <p>"If you need a gigantic amount of data, then an LLM is the right answer," said Chang. "But if you're finding that most of your friction is with individual employees needing to automate small tasks, or you're dealing with sensitive data and want full control from end to end, or you care about reproducibility, then small models are a feasible option."</p> <p><em>Mary K. Pratt is an award-winning freelance journalist with a focus on covering enterprise IT and cybersecurity management.</em></p> <p> </p> </section> It doesn't need to be a binary choice. Enterprises are warming up to smaller AI models to meet compliance and cost needs while reserving large models for complex jobs. https://cdn.ttgtmedia.com/rms/onlineimages/strategy_g1227074505.jpg https://www.techtarget.com/searchdatamanagement/feature/SLM-vs-LLM-Rightsize-data-architecture-to-optimize-AI-use Tue, 17 Mar 2026 09:38:00 GMT SLM vs. LLM: Rightsize data architecture to optimize AI use <p>AI is in production across the enterprise, but have you adapted your data governance practices to keep pace?</p> <p>As business units chase their own priorities, data remains scattered and inconsistent, heightening the risk of faulty AI output and poor decisions. The fix is not another dashboard but <a href="https://www.techtarget.com/searchdatamanagement/feature/AI-data-governance-guidance-that-gets-you-to-the-finish-line">a federated data and AI governance model</a> that establishes ownership, checks accuracy and monitors data quality and lineage in real time. By implementing a combination of technological and organizational changes tailored to use AI at scale, companies can move faster without sacrificing security or compliance.</p> <section class="section main-article-chapter" data-menu-title="Why production AI raises the bar on data governance"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why production AI raises the bar on data governance</h2> <p>Today, AI has become something organizations depend on, with <a href="https://www.techtarget.com/searchcio/feature/Pillars-of-an-agentic-AI-strategy">AI agents embedded in workflows</a> to support task coordination and faster decision-making. The larger the organization, the larger the challenge, as operating companies, business units and departments have their own priorities and data practices. This makes proactive AI data governance a must.</p> <p>For example, say you want to answer a seemingly simple question like, "Who are my most profitable customers?" To do so, you need to be able to gather data from across the enterprise on customers, their purchases and the cost of marketing and selling to them. This helps confirm operations within complex multinational units. Rather than hunt for insights in charts, business users can use conversational analytics and agentic BI to fetch the information they need or explain the context of data.</p> <p>The landscape has shifted in recent years because organizations can no longer completely trust data or assume a person generated it. Enterprises face new risks if their AI tools produce faulty answers after training on outdated or inaccurate models. To combat this issue, many are adopting a zero-trust approach to AI data governance to establish authentication and verification measures.</p> <p>It is estimated that unstructured data typically <a href="https://www.techtarget.com/searchbusinessanalytics/feature/Ng-Biggest-benefit-of-AI-may-be-unlocking-unstructured-data">takes up about 80% of the enterprise's information assets</a> and grows about four times faster than structured data. It requires significant effort to align data collection, creation, classification, formatting and use across organizational boundaries. As a result, modern governance tools have evolved into the control plane for handling data at this scale.</p> <div class="imagecaption alignLeft"> <img src="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f.png" alt="An educational graphic titled ">The key benefits of data governance in organizations include improved data quality, regulatory compliance and more accurate decision-making. </div> </section> <section class="section main-article-chapter" data-menu-title="Build a federated governance strategy that scales"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Build a federated governance strategy that scales</h2> <p>Data governance is no longer just an IT problem. Today, it's being led by the C-suite in a federated model that expands data governance to include AI governance, covering model inventory, risk assessment and post-deployment monitoring, rather than treating AI as a separate track.</p> <p>Gartner predicts that by 2028, about half of organizations will adopt zero-trust data governance and recommends appointing a dedicated AI governance leader to oversee policies, AI risk management and compliance with data and analytics teams.</p> <p>In the federated model, a steering committee of senior executives and data owners sets policy and resolves definition disputes. This board can require business units to align their systems and processes to approved standards. Distributed data stewards embed the policies into daily workflows.</p> <p>Modern data governance also depends on the platform foundation. As organizations break down operational and analytical silos, the <a href="https://www.techtarget.com/searchdatamanagement/feature/Building-a-strong-data-analytics-platform-architecture">data lakehouse has emerged as the single governed foundation</a> for analytics and AI, reducing copies and data fragmentation. When paired with active metadata management, the data governance stack alerts teams when data requires recertification or updating.</p> <p>Once that structure is in place, the question becomes how to realize the benefits for a collaborative AI-based organization.</p> <p>These programs also have a clear mission statement, a business case tied to AI operational readiness, training that improves data and AI fluency, and a process for communicating progress and results. The steps outlined here will put organizations on the path to better data governance and more reliable AI.</p> <div class="youtube-iframe-container"> <iframe id="ytplayer-0" src="https://www.youtube.com/embed/BqdPuwvwPk4?autoplay=0&modestbranding=1&rel=0&widget_referrer=null&enablejsapi=1&origin=https://www.techtarget.com" type="text/html" height="360" width="640" frameborder="0"></iframe> </div> </section> <section class="section main-article-chapter" data-menu-title="Why the federated governance model pays off"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why the federated governance model pays off</h2> <p>Here are the key benefits a <a href="https://www.techtarget.com/searchdatamanagement/definition/data-governance">successful data governance program</a> can produce in an organization.</p> <p><b>1. Greater efficiency. </b>With access to well-governed data, AI agents improve operational efficiency across many areas. They take on manual work, such as schema mapping and duplicate merging, giving teams more time to weed out underperforming product lines and invest more in those that show promise. Analyzing business processes can reveal opportunities to improve them -- but only if the data underlying those processes is reliable. According to Gartner, 60% of organizations without cohesive data governance won't get full value out of AI until 2027.</p> <p><b>2. Better data quality.</b> Despite significant IT spending, data quality problems persist. A 2026 LeBow College of Business study cites ongoing data‑integrity gaps, saying better quality was a priority for 51% of data and analytics leaders. In the same study, 43% said data readiness was the biggest obstacle to AI initiatives. <a href="https://www.techtarget.com/searchenterpriseai/feature/9-data-quality-issues-that-can-sideline-AI-projects">Improved data quality reduces AI errors</a> and protects the organization in the event of a breach. An IBM study from 2025 found that the average cost of a data breach in the U.S. was $10.22 million. Well-cataloged data reduces sensitive information and helps security by making it easier to run discovery, classification, access control and encryption.</p> <p><b>3. Better compliance. </b>The EU AI Act sets obligations on AI systems while GDPR and California's <a href="https://www.techtarget.com/searchsecurity/tip/State-of-data-privacy-laws">CCPA/CPRA regulate</a> the collection, use and auditing of personal data, imposing additional compliance requirements across sectors. About 20 U.S. states have enacted comprehensive privacy laws. European regulators imposed €1.2 billion in GDPR fines in 2025. With sums of this size, organizations need accurate, auditable reporting and governance-backed security and privacy controls to reduce the risk of fines and legal action.</p> <p><b>4. Better decision-making.</b> Sound data gives executives and their teams confidence to make better business decisions on price adjustments, product strategy, customer service and other aspects of operations. This <a href="https://www.techtarget.com/searchdatamanagement/tip/Metadata-management-standards-examples-that-guide-success">depends heavily on metadata management</a> -- the catalogs that handle governance and notify users when data requires correction -- to ensure accurate data for strategic planning, business intelligence and advanced analytics.</p> <p><b>5. Improved business performance.</b> Ultimately, the benefits described above should lead to higher revenue and profits as companies rewire their operating models to take advantage of AI capabilities. LeBow's 2026 study <a target="_blank" href="https://www.lebow.drexel.edu/sites/default/files/2026-01/lebow-precisely-state-data-integrity-ai-readiness-2026.pdf" rel="noopener">found</a> higher data trust in organizations with governance programs than those without at 71% vs. 50%. Business leaders now treat data and AI literacy as a basic requirement and push for integrating data and AI governance to quicken decision-making and improve business performance.</p> <p><b>6. Enhanced business reputation.</b> In addition to tangible financial gains, effective data governance produces high-quality data, which fosters better customer interactions and drives higher satisfaction and loyalty.</p> <p><b>Editor's note:</b> <i>TechTarget editors updated this article in March 2026 for timeliness and to add new information.</i></p> <p><i>Tom Walat is an editor and reporter for TechTarget, where he covers data technologies.</i></p> <p><i>Andy Hayler is an independent analyst on enterprise data management strategy.</i></p> </section> Follow this practical blueprint to adopt a modern data governance approach that aligns people, processes and platform to deliver measurable results from AI across the business. https://cdn.ttgtmedia.com/visuals/searchContentManagement/governance_strategy/contentmanagement_article_009.jpg https://www.techtarget.com/searchdatamanagement/tip/5-benefits-of-building-a-strong-data-governance-strategy Tue, 10 Mar 2026 09:15:00 GMT Build trust on a federated governance model <p>When data sprawls across various repositories, <a href="https://www.techtarget.com/searchdatamanagement/definition/data-management">data management</a> becomes more challenging. Analytics and AI applications are also less effective if data scientists and other end users can't find relevant data or understand its business context. In many cases, "organizations are drowning in data yet starving for insights," said Priya Iragavarapu, managing director of the AI practice at consulting firm AArete.</p> <p>Data catalogs provide a unified inventory of enterprise data assets, making them more manageable, accessible and understandable. Data management teams can use a wide range of tools to <a href="https://www.techtarget.com/searchdatamanagement/answer/What-steps-are-key-to-building-a-data-catalog">build and manage catalogs</a>. Data catalog tools collect metadata from various data sources and use it to organize, classify and enrich data entries. They're commonly integrated with <a href="https://www.techtarget.com/searchdatamanagement/definition/data-governance">data governance</a> software to help organizations manage data quality, data use and regulatory compliance.</p> <p>Increasingly, data catalog software also incorporates generative AI (GenAI), machine learning (ML) and other AI technologies to streamline catalog development and use. For example, early data catalogs required custom scripts to crawl data and harvest metadata, but modern tools do so automatically. AI assistants and agents handle cataloging tasks and help end users find data.</p> <p>To help inform product evaluations by data leaders, the following are 15 notable data catalog tools, listed in alphabetical order with details on their key features and capabilities. TechTarget editors compiled the list based on research of available technologies, as well as market reports and vendor rankings from Forrester Research and Gartner.</p> <section class="section main-article-chapter" data-menu-title="1. Alation Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>1. Alation Data Catalog</h2> <p>Alation Data Catalog uses AI, ML, automation and natural language processing to simplify data discovery, create business glossaries and power its core Behavioral Analysis Engine. The engine generates popularity rankings, usage recommendations and other insights about data sets. It also analyzes data usage patterns to help streamline <a href="https://www.techtarget.com/searchdatamanagement/definition/data-stewardship">data stewardship</a>, data governance and query optimization processes.</p> <p>Allie AI is an AI copilot that documents new data assets, recommends metadata descriptions and identifies potential data stewards. Alation Data Catalog includes a set of prebuilt analytics dashboards with customizable reporting, and collaboration features enable users to create wiki articles and searchable conversations in data catalogs. The tool is part of the broader Alation Agentic Data Intelligence Platform, which also offers data governance, <a href="https://www.techtarget.com/searchdatamanagement/tip/How-data-lineage-tools-boost-data-governance-policies">data lineage</a> and data product marketplace applications.</p> <p>Other key features in Alation Data Catalog include the following:</p> <ul class="default-list"> <li>Capabilities for flagging data health issues and defining enterprise <a href="https://www.techtarget.com/searchcio/definition/data-governance-policy">data governance policies</a>.</li> <li>More than 120 connectors to data sources, plus an Open Connector Framework SDK for building custom ones.</li> <li>A SQL editor for creating data queries that can be published in catalogs for sharing and reuse.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="2. Alex Augmented Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>2. Alex Augmented Data Catalog</h2> <p>Alex Augmented Data Catalog provides various automation, AI and ML capabilities to support catalog creation and metadata management. Developed by Alex Solutions, the software automates data discovery and cataloging. It includes built-in features for data profiling, lineage tracking and metadata enrichment, plus a set of AI agents for tasks such as <a href="https://www.techtarget.com/searchsecurity/tip/How-to-write-a-data-classification-policy-with-template">data classification</a> and anomaly detection in data sets.</p> <p>The data catalog tool also automates aspects of data governance and <a href="https://www.techtarget.com/searchdatamanagement/feature/Proactive-practices-for-data-quality-improvement">data quality processes</a>. Data governance managers can use it to create policies, assign data stewards and monitor compliance with internal policies and regulatory requirements. The software automatically identifies data quality issues, and data stewards can analyze their potential impact on business workflows and alert data owners about necessary fixes.</p> <p>Alex Augmented Data Catalog also provides the following features:</p> <ul class="default-list"> <li>Google-like natural language search and query capabilities.</li> <li>Plug-and-play metadata connectors to various data sources.</li> <li>A no-code ontology for classifying and organizing data based on business terminology, processes and objectives.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="3. Ataccama Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>3. Ataccama Data Catalog</h2> <p>Ataccama Data Catalog is a core component of Ataccama One, an AI-driven platform centered on data quality management. The tool automatically monitors data sets for anomalies, data quality issues and structural changes while providing built-in quality rules as well as capabilities for creating custom ones. It also captures data lineage documentation and includes data profiling, data classification and metadata management capabilities.</p> <p>An AI agent added to Ataccama One in November 2025 handles various tasks autonomously in data catalogs, such as profiling data, <a href="https://www.techtarget.com/searchdatamanagement/tip/Evaluating-data-quality-requires-clear-and-measurable-KPIs">assessing data quality</a>, and creating and applying quality rules. GenAI capabilities enable catalog users to create SQL queries, generate data descriptions and perform other tasks in natural language. In addition, Ataccama Data Catalog runs AI and ML algorithms to identify patterns, trends and relationships in data sets.</p> <p>The catalog software also includes the following features:</p> <ul class="default-list"> <li>Indexing of reports, dashboards and data stories for BI and analytics uses.</li> <li>Collaboration features that enable users to add comments and ask questions about data assets.</li> <li>A data marketplace function to make data products available for reuse.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="4. Atlan Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>4. Atlan Data Catalog</h2> <p>Atlan Data Catalog borrows design principles from Google and tools such as GitHub and Slack. It enables users to search for data assets in natural language using keywords or associated business metrics, while providing a SQL syntax search capability for data engineers. Organizations can also integrate collaborative data workflows into catalogs. For example, users can discuss data in Slack chats and create Jira tickets to report data issues.</p> <p>A Companion Sidebar feature provides at-a-glance information about data lineage, usage history, Slack threads, Jira issues and more to help users decide whether data is relevant and trustworthy. Atlan AI, a copilot tool, generates descriptions of data assets, data lineage summaries, SQL queries and definitions of business terms, metrics and KPIs.</p> <p>Part of a broader data and metadata management platform that aims to create an enterprise context layer in organizations, Atlan Data Catalog also includes the following features:</p> <ul class="default-list"> <li>Open APIs that enable fully customizable metadata ingestion.</li> <li>More than 80 connectors to data platforms and tools.</li> <li>Role-based filtering to personalize catalog browsing for different users.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="5. AWS Glue Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>5. AWS Glue Data Catalog</h2> <p>AWS Glue Data Catalog is the persistent metadata store in AWS Glue, a fully managed extract, transform and load (ETL) service. Data management teams can use it to store, annotate and share metadata for use in ETL data integration jobs on the AWS cloud platform. It also provides a consistent metadata layer for querying and analyzing data across various AWS data stores, using integrated analytics services such as Amazon Athena, Amazon EMR, Amazon Redshift Spectrum and Amazon SageMaker AI.</p> <p>As in traditional relational database catalogs, AWS Glue Data Catalog organizes metadata into databases and tables. The software is compatible with the metastore repository in Apache Hive and can be used as an external metastore for Hive data in Amazon EMR clusters. Organizations can also import technical metadata from the catalog tool into business data catalogs in Amazon DataZone, a separate data management service.</p> <p>Other features in AWS Glue Data Catalog include the following:</p> <ul class="default-list"> <li>A wizard for creating crawlers that automatically scan data sources and extract metadata.</li> <li>Automated schema management and data lineage documentation.</li> <li>Integration with AWS Lake Formation for defining and managing data access policies.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="6. Coalesce Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>6. Coalesce Catalog</h2> <p>Known as <i>CastorDoc</i> before Coalesce acquired and renamed it in March 2025, this AI-powered tool provides automated data documentation and natural language search capabilities. An AI assistant helps catalog users find relevant data, write SQL queries and understand data governance policies. Coalesce Catalog also automatically maps data lineage information and creates a metadata-driven semantic layer that applies business context to the data in a catalog.</p> <p>Advanced data filtering, popularity signals, freshness indicators and certification badges help users evaluate data assets for use in analytics applications. Coalesce Catalog streamlines data classification and incorporates data governance features, including role-based access control and guided access request workflows. The catalog software is offered in a platform alongside Coalesce's original data transformation tool; the company is working to fully integrate the two tools.</p> <p>Coalesce Catalog also includes the following features:</p> <ul class="default-list"> <li>Integration with more than 30 data platforms and related tools.</li> <li>Interfaces for searching data catalogs in Slack or Teams.</li> <li>Audit trails for monitoring data use and regulatory compliance.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="7. Collibra Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>7. Collibra Data Catalog</h2> <p>Collibra offers a namesake <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-and-AI-governance-must-team-up-for-AI-to-succeed">data and AI governance</a> platform centered on Collibra Data Catalog. The tool provides automated data discovery, classification and curation powered by AI and ML, including the use of GenAI to create descriptions of data assets. It also automates data profiling and data lineage mapping across source systems. An integrated AI copilot helps catalog users find data and associated business definitions.</p> <p>Collibra Data Catalog includes more than 100 prebuilt integrations for ingesting metadata from various data stores, business applications, BI platforms and <a href="https://www.techtarget.com/searchbusinessanalytics/feature/15-data-science-tools-to-consider-using">data science tools</a>. It provides configurable workflows for managing data catalogs, as well as guided data stewardship features and controls for enforcing data security and privacy protections. An embedded semantic layer connects technical metadata to business terms and concepts.</p> <p>The Collibra software also offers the following features:</p> <ul class="default-list"> <li>Built-in views of <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-governance-metrics-Data-quality-data-literacy-and-more">data quality metrics</a> and support for certifying trustworthy data.</li> <li>Collaboration capabilities, including crowdsourced feedback on data assets through ratings, reviews and comments.</li> <li>An integrated data marketplace where users can search for relevant data products and other curated data assets.</li> </ul> <figure class="main-article-image full-col" data-img-fullsize="https://www.techtarget.com/rms/onlineimages/example_of_how_a_data_catalog_works-f.png"> <img data-src="https://www.techtarget.com/rms/onlineimages/example_of_how_a_data_catalog_works-f_mobile.png" class="lazy" data-srcset="https://www.techtarget.com/rms/onlineimages/example_of_how_a_data_catalog_works-f_mobile.png 960w,https://www.techtarget.com/rms/onlineimages/example_of_how_a_data_catalog_works-f.png 1280w" alt="Diagram showing an example of how data catalogs work." height="319" width="560"> <figcaption> <i class="icon pictures" data-icon="z"></i>Data catalog tools automate the creation and management of data inventories in organizations. </figcaption> <div class="main-article-image-enlarge"> <i class="icon" data-icon="w"></i> </div> </figure> </section> <section class="section main-article-chapter" data-menu-title="8. Data.world"> <h2 class="section-title"><i class="icon" data-icon="1"></i>8. Data.world</h2> <p>Acquired by ServiceNow in July 2025, Data.world is a cloud-native data catalog tool offered as a SaaS platform. It's built on a knowledge graph architecture that provides a semantically organized view of enterprise data assets and their associated metadata across disparate systems. It also automates data quality checks, tracks data lineage and creates visualized maps of data relationships and dependencies.</p> <p>Data.world includes a set of AI bots that help organizations deploy and manage data catalogs and automate data governance tasks. Archie Chat, a conversational AI assistant, provides a chat-like data discovery interface to assist catalog users in data searches, suggest research questions and generate natural language descriptions of data assets and metadata.</p> <p>Other notable features in Data.world include the following:</p> <ul class="default-list"> <li>Collaborative querying capabilities and automated documentation of queries and related comments in a searchable repository.</li> <li>Customizable data governance workflows and task management processes.</li> <li>A data product marketplace with an online shopping UX.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="9. Dataplex Universal Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>9. Dataplex Universal Catalog</h2> <p>Dataplex Universal Catalog ingests technical metadata from Google Cloud and on-premises data sources and enables users to enrich it with business context. Google released the tool in 2024 to replace an older data catalog service. New features include a unified web interface and API, more advanced governance capabilities and wider metadata support.</p> <p>Metadata change feeds enable data teams to track metadata updates in near real time and trigger automated workflows, such as data quality scans, compliance audits and security policy updates, when specified changes occur. Dataplex Universal Catalog also integrates with Google's BigQuery data platform and Vertex AI service to support data and <a href="https://www.techtarget.com/searchdatamanagement/feature/Exploding-interest-in-GenAI-makes-AI-governance-a-necessity">AI governance initiatives</a>. It automatically captures data lineage documentation and includes built-in data profiling and data quality management capabilities.</p> <p>The catalog software also includes the following features:</p> <ul class="default-list"> <li>Automated metadata harvesting from various Google Cloud data sources, and support for ingesting metadata from other systems.</li> <li>Keyword and natural language search options.</li> <li>Search-driven access to data insights generated in BigQuery Studio using Google's Gemini AI assistant.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="10. Erwin Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>10. Erwin Data Catalog</h2> <p>Erwin Data Catalog is now part of Quest Trusted Data Management Platform, a software suite introduced in February 2026 that also includes Quest Software's data modeling, governance, quality and marketplace tools. The software automatically harvests, catalogs, enriches and curates metadata. It also supports drag-and-drop data mapping, reference data management, data lifecycle management, data lineage documentation and data classification.</p> <p>Standard data connectors ingest data from commonly used databases. Optional ones are available for streaming data, cloud applications, BI environments and other data sources. Erwin Data Catalog integrates with a companion data literacy tool to aid in data discovery and governance. Built-in version management and change control functions track changes to data mappings and documentation.</p> <p>The catalog tool also provides the following features:</p> <ul class="default-list"> <li>A dashboard provides high-level views of data catalog metrics and drill-down analysis capabilities.</li> <li>An impact analysis function for assessing the potential effects of changes to data attributes or tables.</li> <li>Automated functions accelerate data movement and transformation, as well as code generation and documentation.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="11. IBM Watsonx.data intelligence"> <h2 class="section-title"><i class="icon" data-icon="1"></i>11. IBM Watsonx.data intelligence</h2> <p>IBM Watsonx.data intelligence is a data governance and metadata management software suite launched in May 2025 that includes the former IBM Knowledge Catalog tool. It catalogs structured, unstructured and semistructured data, as well as ML models and other analytics assets. It supports AI-driven data discovery and provides <a href="https://www.techtarget.com/searchdatamanagement/opinion/Human-oversight-enables-automated-data-governance">automated data governance</a> functions for tasks such as data quality assessments and data privacy policy management.</p> <p>The software also includes metadata enrichment capabilities powered by large language models, plus a set of Knowledge Accelerators -- industry-specific vocabularies of business terms designed to streamline data governance and analytics deployments. It visually maps relationships between data assets and governance artifacts, using a knowledge graph and the FoundationDB open source database originally developed by Apple.</p> <p>The catalog tool offers the following features as well:</p> <ul class="default-list"> <li>Data profiling, cleansing and validation capabilities.</li> <li>Support for creating data protection rules to control access to sensitive data.</li> <li>Integration with data lineage and data product marketplace tools that are also part of Watsonx.data intelligence.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="12. Informatica Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>12. Informatica Data Catalog</h2> <p>Informatica Data Catalog is part of the AI-powered Intelligent Data Management Cloud (IDMC) platform developed by Informatica, which Salesforce acquired in November 2025. The catalog tool uses Claire, Informatica's AI engine, to automatically find, ingest, classify and inventory data. Automated data curation features also use AI and ML algorithms to identify relationships between data sets and associate business terms with technical metadata.</p> <p>Data lineage capabilities track data as it moves through systems and data pipelines, supporting impact analysis when data changes. Built-in collaboration capabilities let users add reviews, ratings and annotations to data assets, and subject matter experts can answer questions through a Q&A feature. Informatica Data Catalog also integrates with other IDMC tools, including data governance and data marketplace services.</p> <p>In addition, the catalog software provides the following features:</p> <ul class="default-list"> <li>Automated data profiling and built-in functions for applying data quality rules and monitoring quality levels.</li> <li>A natural language search function and browsable hierarchical views for finding relevant data in a catalog.</li> <li>A knowledge graph that visually displays the connections between related data assets.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="13. Microsoft Purview Unified Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>13. Microsoft Purview Unified Catalog</h2> <p>This tool is part of Microsoft Purview, a data security, governance and compliance service that runs in the Microsoft Azure cloud. Initially known as <i>Microsoft Purview Data Catalog</i>, it was renamed in late 2024, when Microsoft launched a revised data governance offering. The software runs on top of Microsoft Purview Data Map, a companion metadata management tool that scans data sources, ingests metadata and automatically classifies data.</p> <p>In Microsoft Purview Unified Catalog, users can search for individual data assets or data products, such as tables, files and Microsoft Power BI reports. An embedded business glossary can also be used to find relevant data products by searching for terms, key data elements or business objectives. An AI copilot aids in catalog searches.</p> <p>Other features in the catalog tool include the following:</p> <ul class="default-list"> <li>Data curation for organizing data by governance domains and grouping related data assets and products.</li> <li>Built-in data quality rules, plus data quality scanning, scoring and alerting functions.</li> <li>Workflows to help organizations track data governance practices and address issues.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="14. OvalEdge Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>14. OvalEdge Data Catalog</h2> <p>OvalEdge Data Catalog is the foundation of OvalEdge's namesake data governance platform. The catalog software crawls data sources and uses AI and ML algorithms to ingest, organize, enrich and curate data assets. It also provides AI-driven data classification and automated data lineage generation, including data flow diagrams that show how data moves through systems. Users can create custom fields to collect extended metadata types, such as access permissions and source-specific attributes.</p> <p>OvalEdge Data Catalog includes built-in functions for data profiling and documenting relationships between data objects. It also tracks data use and generates popularity and importance scores to help teams prioritize data curation efforts. The tool supports both keyword and natural language search. OvalEdge's agentic AI chatbot, askEdgi, streamlines metadata search and analysis and triggers automated data governance workflows.</p> <p>The data catalog software also includes the following features:</p> <ul class="default-list"> <li>Native connectors to more than 150 data sources.</li> <li>Question Wall, a centralized hub for knowledge sharing and collaboration.</li> <li>Integration with tools such as Slack, Jira and ServiceNow.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="15. Precisely Data Catalog"> <h2 class="section-title"><i class="icon" data-icon="1"></i>15. Precisely Data Catalog</h2> <p>A foundational component of Precisely Data Integrity Suite -- a broad data management and governance platform -- Precisely Data Catalog uses AI algorithms to automatically ingest metadata and generate data descriptions. Users can then curate data with Precisely's Gio AI Assistant and Data Catalog Agent. For example, the agent identifies and tags critical data, flags personal information for oversight and aligns metadata with business processes and regulatory compliance needs.</p> <p>Precisely Data Catalog automatically applies data quality rules and scores to metadata. It also creates data profiles and continuously monitors data health to detect anomalies and other issues. Connectors to more than 20 data sources are currently available, while a variety of others are planned or available by request.</p> <p>The catalog tool also includes the following features:</p> <ul class="default-list"> <li>Use of Precisely's data governance service to add business metadata to catalog entries.</li> <li>Visualization of data lineage and relationships.</li> <li>Integration with workflow management and data security tools that are also part of the Precisely platform's foundation.</li> </ul> <p><b>Editor's note:</b><i> TechTarget editors updated this article in March 2026 for timeliness and to add new information.</i></p> <p><em>George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.</em></p> <p><i>Craig Stedman is an industry editor at TechTarget who creates in-depth packages of content on data technologies and processes.</i></p> </section> Organizations can use numerous tools to build and manage data catalogs. Here are 15 prominent ones that data leaders should consider for their data management needs. https://cdn.ttgtmedia.com/rms/onlineimages/folder-files11.jpg https://www.techtarget.com/searchdatamanagement/feature/16-top-data-catalog-software-tools-to-consider-using Mon, 09 Mar 2026 09:00:00 GMT 15 top data catalog software tools to consider using in 2026 <p>Storage governance is increasingly critical as organizations face more data privacy regulations and use analytics to help improve IT operations and business.</p> <p>Governance initiatives had focused on structured data stored in relational databases, but the process has become more complex. Structured, <a href="https://www.techtarget.com/searchstorage/feature/Managing-unstructured-data-to-boost-performance-lower-costs">unstructured and semistructured data</a> are all part of storage governance now, so it's crucial for IT leaders to have a comprehensive strategy.</p> <section class="section main-article-chapter" data-menu-title="Why data storage governance is important"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why data storage governance is important</h2> <p>Organizations are dependent on data, databases, big data, applications and other information resources, so an approved program to protect and manage data resources is essential. Data storage governance provides an understanding of how organizations create, use, authorize, secure, protect, maintain, store, archive and destroy data.</p> <p><span style="font-size: 16px;">A data storage governance policy is the first step to achieving a </span><a href="https://www.techtarget.com/searchstorage/feature/Explore-secure-data-storage-best-practices" style="font-size: 16px;">secure storage environment</a><span style="font-size: 16px;">. The policy is the starting point for defining governance procedures and additional policies, such as data storage, cloud storage, data privacy, data protection and data management. These policies are important audit evidence because they establish controls that govern data storage operations.</span></p> <figure class="main-article-image full-col" data-img-fullsize="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f.png"> <img data-src="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f_mobile.png" class="lazy" data-srcset="https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f_mobile.png 960w,https://www.techtarget.com/rms/onlineImages/data_management-need_to_govern_data-f.png 1280w" alt="Why govern data? " height="283" width="560"> <figcaption> <i class="icon pictures" data-icon="z"></i>These are some of the top reasons to have a general data governance program. </figcaption> <div class="main-article-image-enlarge"> <i class="icon" data-icon="w"></i> </div> </figure> </section> <section class="section main-article-chapter" data-menu-title="9 best practices for data storage governance"> <h2 class="section-title"><i class="icon" data-icon="1"></i>9 best practices for data storage governance</h2> <p>There are several ways organizations can optimize their storage governance, such as with management support and policy training. Best practices include the following:</p> <ol type="1" start="1" class="default-list"> <li>Understand how to establish and <a href="https://www.techtarget.com/searchstorage/Data-storage-management-What-is-it-and-why-is-it-important">manage storage</a>, and obtain ongoing support from senior management.</li> <li>Comprehend the government regulations the organization must follow to <a href="https://www.techtarget.com/searchstorage/tip/6-data-storage-compliance-strategies-for-the-enterprise">achieve compliance</a>. Comply with data storage and data management standards. These are audit considerations.</li> <li>Establish an overall policy for data storage governance, which the organization can supplement with additional policies for data storage management, data privacy, data protection and other issues. The policy can also <a href="https://www.techtarget.com/searchstorage/tip/12-ways-to-manage-your-data-storage-strategy">define the overall data storage strategy</a> for the organization. Document procedures for all aspects of data storage. Both policies and procedures are important audit items.</li> <li>Periodically review and vet data storage governance activities to ensure their relevance and effectiveness. Conduct a periodic <a href="https://www.techtarget.com/searchsecurity/answer/Risk-assessment-vs-risk-analysis-vs-risk-management">risk analysis</a> of data storage activities to ensure that the governance program has identified risks, threats and vulnerabilities.</li> <li>Pay attention to cloud and hybrid storage management. Ensure governance policies cover location, redundancy, access controls and data sovereignty of data stored across cloud, hybrid and multi-cloud environments.</li> <li>Periodically train and retrain employees who regularly use storage technologies on the proper procedures. New employees should receive training on the proper storage procedures during onboarding.</li> <li>Establish <a href="https://www.techtarget.com/searchstorage/answer/How-can-organizations-prepare-for-a-data-storage-audit">IT auditing activities</a> to ensure the organization follows data storage controls correctly and that those controls are appropriate for the business's requirements. The storage strategy, which ideally is embedded in the storage governance policy, helps define the controls.</li> <li>Go beyond technology with storage governance. Identify how data storage supports strategic intelligence requirements. Organizations might focus on data value, analytics readiness and compliance, for example.</li> <li>Regularly communicate with senior IT and company management to reinforce the value of data storage programs and to report on the program's success and how it aligns with goals and objectives.</li> <li>Establish a team to support the data storage governance program. Members can include technicians, data quality technicians, a chief data officer, and advocates or stewards from key divisions and departments that promote governance activities.</li> </ol> </section> <section class="section main-article-chapter" data-menu-title="Overcome challenges to data storage governance"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Overcome challenges to data storage governance</h2> <p>Establish senior management support and a thorough understanding of how data storage helps the firm achieve its goals. These two items greatly help overcome any resistance to a storage governance program.</p> <p>Supplement the above achievements with documented and approved policies and procedures for data storage governance. Once approved, circulate them among all employees so they are aware of the program and comply with the policies. <a target="_blank" href="https://www.lightsondata.com/top-10-data-governance-courses-and-training/" rel="noopener">Schedule training</a> on the data storage governance program soon after it launches. Set up refresher training as needed.</p> <p>Periodically review the data storage governance program to keep the program current and aligned with the organization's requirements. Risk assessments help to identify and mitigate any potentially disruptive issues. In addition to program reviews, periodically <a href="https://www.techtarget.com/searchstorage/tip/Perform-data-storage-testing-to-prevent-issues">test and validate storage procedures</a> to ensure everything is working.</p> </section> Data governance manages the availability, usability, integrity and security of data. Follow these best practices for governance as it relates to data storage. https://cdn.ttgtmedia.com/rms/onlineimages/storage_g922017556.jpg https://www.techtarget.com/searchstorage/tip/How-to-optimize-data-storage-governance Fri, 06 Mar 2026 00:00:00 GMT How to optimize data storage governance <p>Technology-led data governance programs are no longer viable. To improve data quality, security, management and stewardship, the entire C-suite must get involved.</p> <p>Many organizations still run data governance through IT, going system by system. In early 2024, executives at professional services firm BDO USA recognized that a holistic, business-led approach to data governance was more effective and created a 45-member <a href="https://www.techtarget.com/searchdatamanagement/tip/Data-governance-roles-and-responsibilities-Whats-needed">data governance team</a> of data stewards, business owners and data trustees. According to Mike Gerhard, the chief data and AI officer at BDO, these are executives who provide guidance to achieve company objectives.</p> <p>"We knew we needed data to do a lot of what we wanted to do and to innovate, so we saw the need to transform [the governance program]. We had to change our mindset to see and govern data as a shared commodity," Gerhard said. "Governance today is a team collaboration to make sure we're doing the right thing for the firm and for our clients."</p> <p>To stay competitive, the C-suite must drive data governance and create a culture of shared responsibility in which each function works toward measurable results.</p> <section class="section main-article-chapter" data-menu-title="'Every executive is a data leader'"> <h2 class="section-title"><i class="icon" data-icon="1"></i>'Every executive is a data leader'</h2> <p>The 2026 Executive Benchmark Survey from technology company <a target="_blank" href="https://www.businesswire.com/news/home/20260203332401/en/Workiva-Executive-Benchmark-Survey-Finds-Instability-is-Accelerating-Data-Automation-and-Governance-in-2026" rel="noopener">Workiva</a> found that 79% of business leaders are prioritizing data automation and governance. Moreover, 96% of survey respondents said the CFO, CIO and CSO must unite around a shared <a target="_blank" href="https://www.informationweek.com/data-management/how-to-create-a-sound-data-governance-strategy" rel="noopener">data governance strategy</a>, and 96% said better access to shared data improves the likelihood of achieving optimal business outcomes.</p> <p>The results confirm what Gerhard and others are seeing: in a data-driven economy, data's strategic value elevates it to a critical asset that demands attention across the C-suite.</p> <p>"We're now in a world where every executive is a data leader," said Scott Beale, CEO of ISC2, a nonprofit member organization for cybersecurity professionals. "We have to ensure there is shared ownership of data governance and that everyone is aligned on risk, strategic goals and ethics."</p> <p>Beale added that this process is always evolving and improving.</p> <p>"Even organizations that are doing it well could do it better," he said.</p> </section> <section class="section main-article-chapter" data-menu-title="How to foster shared responsibility"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How to foster shared responsibility</h2> <p>Creating an enterprise culture of shared governance within the C-suite isn't easy, experts said. However, organizations can overcome this challenge when the board and the CEO establish the practices and policies that distribute accountability across the leadership team.</p> <p>"It starts with selling a strategy and making sure that strategy is clear," Jon France, CISO of ISC2, said. "It comes with that strategy spark at the top."</p> <p>Beale said chief executives and boards are already doing that work. Leadership is working to <a href="https://www.techtarget.com/searchdatamanagement/tip/7-best-practices-for-successful-data-governance-programs">mature their data governance programs</a> because it's essential for automation, analytics and AI.</p> <p>Similarly, Gerhard noted that BDO's business-led collaborative approach provided the firm with more reliable data, which in turn drives more <a href="https://www.techtarget.com/searchitoperations/definition/agentic-process-automation">automation and AI efforts</a>. That result has put a brighter spotlight on data governance across the enterprise.</p> <p>"Every CEO now recognizes there is no separating the data strategy and business strategy," Beale said. "They're making sure data is governed in a way that allows the organization to be as competitive as possible and protects trust and enterprise value. They recognize that means data can't live in silos."</p> <p>Beale said that it starts with top enterprise leaders establishing clear rules and a strategy for the C-suite to work together. Leadership must set expectations and determine KPIs so each team member is accountable for performance.</p> <p>The executives might also want to establish a <a href="https://www.techtarget.com/searchcio/definition/steering-committee">steering committee</a> to guide the organization, help identify gaps in data ownership and responsibility, and foster an enterprise-wide data governance mindset, said Tom Levi, director, field CISO and cyber strategy at cyber exposure management company CYE. He also advises executives to work through governance scenarios in tabletop exercises to practice a collaborative approach.</p> <p>Gerhard said BDO built its data governance program on the belief that the entire firm owns the data, not any single department. While the firm owns the data, business leaders are accountable for data attributes within their functions, reinforcing a culture of shared responsibility.</p> <p>Along these lines, experts said it's important to note that shared data governance responsibility doesn't make the <a href="https://www.techtarget.com/searchdatamanagement/tip/The-evolution-of-the-chief-data-officer-role">chief data officer</a> unnecessary. They stressed that the CDO still oversees data governance even with a team approach.</p> <p>France compared a mature data governance program to financial governance. For example, the CFO is responsible for the organization's financial health, strategy and risk management, even though all executives must be good financial stewards. The same is true for CDOs with data governance.</p> </section> <section class="section main-article-chapter" data-menu-title="Challenges to creating shared responsibility"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Challenges to creating shared responsibility</h2> <p>Leadership might find that creating this culture is easier on paper than in practice. The following are common challenges that leadership might encounter:</p> <ul class="default-list"> <li><b>Silos.</b> Sometimes technical or business structures create data silos, which hinder the ability to share responsibility and accountability across the C-suite, Gerhard said. In such cases, organizations might need to restructure these things to facilitate change.</li> <li><b>Personalities and motivations</b>. In other cases, egos and individual priorities can hinder shared responsibility efforts. "Each C-level person brings in their own incentives, priorities and perspectives," Levi said.</li> <li><b>Data hoarding.</b> Board members and executives might not recognize the value of a collaborative governance approach. France said some executives or managers might hoard data. Those business leaders might fear that sharing could affect data use and their ability to succeed in their business objectives.</li> <li><b>Resistance to change. </b>And in still other cases, an unwillingness to change is a major obstacle. "There may be an executive who has been around for years, and they want to continue to do things their way," Levi said. "Those cultural elements can inhibit the organization from catching up with the market."</li> </ul> <p><em>Mary K. Pratt is an award-winning freelance journalist with a focus on covering enterprise IT and cybersecurity management.</em></p> </section> To improve business outcomes, leadership must move beyond IT controls and adopt a playbook that treats data as a shared enterprise asset with clear roles and policies. https://cdn.ttgtmedia.com/rms/onlineimages/collab_a275903017.jpg https://www.techtarget.com/searchdatamanagement/feature/Data-governance-responsibilities-now-belong-in-the-C-suite Wed, 04 Mar 2026 13:32:00 GMT Data governance responsibilities now belong in the C-suite <p>Enterprises are rapidly adopting AI to unlock its potential, but many fail to address a key area: preparing vast, fragmented data sets so models can use them effectively.</p> <p>AI efforts often slow to a crawl -- or fail entirely -- because teams quickly discover how hard it is to turn raw data into something leaders and systems trust. The real work isn't the modeling but finding the right data, cleaning and governing it, and enforcing standards to keep it consistent and reusable. <a href="https://www.techtarget.com/searchdatamanagement/opinion/2026-will-be-the-year-data-becomes-truly-intelligent">Enterprises that build momentum</a> avoid inertia by continuously monitoring, refining and validating their data. Those practices build hard-won trust that the AI needs to produce accurate, relevant results that push projects from experimentation to production.</p> <p>"Prior to the arrival of AI, corporate decision making was centered around the trustworthiness of your existing data, and most people did not [trust their data]," said Stephen Catanzano, an analyst at Omdia, a division of Informa TechTarget. "And our current research shows most people still don't fully trust their data. So, the question remains: can I give my data to an AI agent and have that agent make decisions for my company, like changing processes? Well, you can't. The definition of AI-ready data starts and ends with trust."</p> <section class="section main-article-chapter" data-menu-title="Why AI data readiness matters now"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why AI data readiness matters now</h2> <p>This trust factor underscores that AI can't deliver meaningful business value until enterprises address their longstanding gaps in data quality and management.</p> <p>AI data readiness is increasingly recognized as the foundation of successful corporate AI initiatives. Analysts highlight its strategic importance, with Gartner forecasting that 60% of <a href="https://www.techtarget.com/searchcio/feature/AI-failure-examples-What-real-world-breakdowns-teach-CIOs">AI projects will be abandoned</a> by the end of 2026 due to inadequate data management. By 2027, the failure rate is expected to climb to 80% for GenAI projects driven by deficiencies in data quality, governance and trust.</p> <p>According to Gartner, siloed data that prevents AI from seeing across multiple CRM, ERP and regulatory systems is a common barrier. Ungoverned data introduces compliance risks and can expose mission-critical data.</p> <p>There's a growing consensus that scalable AI architectures rely on consistent standards to ensure data accuracy, accessibility and compliance. While no single governance model dominates, ISO/IEC 42001 -- the international standard for AI management systems -- offers structured guidance for responsible AI development and oversight. Enterprises often pair it with semantic frameworks, such as Resource Description Framework (RDF) and the Web Ontology Language (OWL). Combined, these approaches strengthen AI data governance, encourage ethical data practices and support scalability.</p> </section> <section class="section main-article-chapter" data-menu-title="Making data trustworthy takes work"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Making data trustworthy takes work</h2> <p>For many organizations, simply identifying what data they have and where it lives remains the biggest obstacle before they can start the refining process.</p> <p>"It's all well and good to get your data AI ready, but if you don't know where your data resides, it's really hard to do that," said Jack Gold, principal analyst with J. Gold Associates. "Companies have isolated or siloed data stashed all over the place. Unfortunately, a lot of companies are still in the middle of that process. If the majority of users were actually prioritizing this, these data lake companies would be worth trillions."</p> <p>Once organizations know what data they have, governance becomes the next hurdle.</p> <p>"AI systems do not just use data -- they learn from it, and that makes governance critical," Catanzano said. "Poorly governed data leads to biased, insecure, and/or noncompliant AI."</p> <p>Governance gives lineage visibility, allowing teams to trace how the data moves and changes across systems. It also enforces access controls to limit exposure of sensitive information and helps organizations <a href="https://www.techtarget.com/searchenterpriseai/tip/Global-AI-legislation-and-regulation-tracker">meet regulations, such as HIPAA, GDPR and the EU AI Act</a>.</p> <p>"Adding in lineage and observability tools is becoming really important," Catanzano said. "They allow you to actually see the data and look for governance challenges, along with being able to map out compliance requirements for data-specific challenges. It has taken people a while to figure out the importance of these tools and how to achieve it. Frankly, most users haven't got to this stage yet."</p> </section> <section class="section main-article-chapter" data-menu-title="How embeddings make AI useful"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How embeddings make AI useful</h2> <p>The next step is making that data relevant at scale. AI performs better when it understands the data's context. Embeddings turn words, images and logs into <a href="https://www.techtarget.com/whatis/definition/vector">vectors</a>, so systems retrieve the right content rather than guessing.</p> <p>Paired with strong metadata, embeddings help AI return the right information at the right time. AI is shifting toward metadata-rich, vector-based retrieval, Catanzano said.</p> <p>"We have been moving towards higher levels of metadata and larger amounts of the vectorization of data," Catanzano noted. "So now, users want to use more AI because vectors create relevancy, which means AI can find the most relevant data based on vector scores and so improve the quality of data being searched for."</p> <p>Vectorization is becoming a baseline requirement for effective AI implementation, Catanzano noted. Converting unstructured data into embeddings and combining them with high-fidelity metadata optimizes retrieval precision and <a href="https://www.techtarget.com/searchdatamanagement/opinion/Data-intelligence-isnt-just-a-buzzword">increases confidence in the accuracy</a> of the information delivered to the end user.<u> </u></p> </section> <section class="section main-article-chapter" data-menu-title="How tokenization improves performance"> <h2 class="section-title"><i class="icon" data-icon="1"></i>How tokenization improves performance</h2> <p>With retrieval grounded in embeddings and metadata, the next lever on performance is how text is prepared for the model. Using enterprise data effectively with AI workflows often requires converting it into formats the language models can understand.</p> <p><a href="https://www.techtarget.com/searchsecurity/definition/tokenization">Tokenization</a> is a key part of the pipeline, but it's only the first step. Once tokenized, the model applies learned patterns and relationships to analyze content, generate responses or make predictions. <a href="https://www.techtarget.com/searchenterpriseai/feature/Can-tokenization-free-up-more-data-for-AI-model-training">Efficient tokenization</a> reduces the number of tokens the system must handle, improving response times and lowering compute and inference costs in the production environment.</p> <p>In modern AI workflows, organizations often convert documents and other unstructured data into <a href="https://www.techtarget.com/searchenterpriseai/definition/vector-embeddings">vector embeddings</a>. This change makes the information available for wider use, enabling more precise insights tied to specific business needs.</p> <p>"Developers and users have to transform their data located, for instance, in a database, into a [format] that can travel across platforms," said Frank Dzubeck, president of Communications Network Architects. "Companies, in the pharmaceutical industry, for instance, are doing that now. It changes the way they can look at data because they can create [embeddings] that specifically address problems they are researching in their industry."</p> </section> <section class="section main-article-chapter" data-menu-title="What are the building blocks of AI‑ready data?"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What are the building blocks of AI‑ready data?</h2> <p>Beyond AI data governance requirements, enterprises achieve the best results when embeddings rest on a strong data layer consisting of several key elements.</p> <ul class="default-list"> <li><b>Standardized structures. </b>Common formats, such as CSV and JSON, help keep data portable, but their real value is applying consistency to the information they hold;</li> <li><b>Smart labeling.</b> Tagging and annotating data ensure AI models can interpret the raw values and the intended meaning behind them;</li> <li><b>A shared language.</b> Semantic frameworks, such as RDF and Shapes Constraint Language (SHACL), act as a translator to give data sets a common structure to promote interoperability; and</li> <li><b>Deep context. </b>Using logic tools, such as OWL, gives the context and meaning needed to form relationships across data sets.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="What to ask vendors before you buy"> <h2 class="section-title"><i class="icon" data-icon="1"></i>What to ask vendors before you buy</h2> <p>Don't be swayed by a polished demo. Press for specifics about how the system handles your data today and what could break tomorrow.</p> <p><a href="https://www.techtarget.com/searchdatamanagement/opinion/How-data-for-AI-is-changing-the-modern-data-platform">When evaluating vendors</a>, anchor your questions in your governance model and data flows. Start with data use and ownership. How is proprietary data isolated? Will your content be used to train shared models, or will it only serve your tenants? What are the defaults for retention, deletion and cross-tenant safeguards?</p> <p>Models trained on your organization's data tend to generate more accurate results that are less susceptible to biases inherent in internet-based data. Ask how the tuning is done and how quality is measured over time.</p> <p>It's also critical to understand the data preparation workflow and the tools involved. What formats are supported? How is the data transformed? What lineage, logging and rollback is available? What are the commitments for backward compatibility as the platform evolves?</p> <p>Without these answers, organizations <a target="_blank" href="https://edmcouncil.org/frameworks/cdmc/14-key-controls/" rel="noopener">risk</a> getting locked into proprietary formats that might not work in later versions.</p> <p>"They need to actually show users their various transformation tools for databases, searching and other functions because they are all different," said Dzubeck. "As for future proofing, that's a tough question for users to get an answer to. They are only going to give you answers that directly link to their products and strategies like Google, which is focused on search, and large database makers like Oracle."</p> <p><i>Ed Scannell is a freelance writer and journalist based in Needham, Mass. He reports on a wide range of technologies and issues related to corporate IT. He can be reached at ed.scannell@gmail.com.</i></p> </section> As organizations dive into AI adoption, many realize the first real bottleneck is not the model but how to prepare their information so it can be used effectively in AI workflows. https://cdn.ttgtmedia.com/rms/onlineimages/ai_a352095729.jpg https://www.techtarget.com/searchdatamanagement/feature/AI-data-governance-guidance-that-gets-you-to-the-finish-line Tue, 03 Mar 2026 07:36:00 GMT AI data governance guidance that gets you to the finish line <p>Big data environments in organizations are only getting bigger. The ever-increasing volume and variety of data collected in them requires investments in big data tools to support analytics and AI applications. But choosing the right technologies is complicated: Enterprise data leaders have a wide variety of tools to consider.</p> <p>The available choices include numerous open source big data tools, many of which are offered by technology vendors in commercial versions or as part of big data platforms. The following are 18 popular open source technologies for <a href="https://www.techtarget.com/searchdatamanagement/feature/How-to-build-an-enterprise-big-data-strategy-in-4-steps">managing and analyzing big data</a>, listed in alphabetical order with an overview of each one's features, capabilities and potential uses. TechTarget editors compiled the list based on their research of available technologies and analysis from consulting firms such as Forrester Research and Gartner.</p> <section class="section main-article-chapter" data-menu-title="1. Airflow"> <h2 class="section-title"><i class="icon" data-icon="1"></i>1. Airflow</h2> <p>Apache Airflow is a workflow management platform for scheduling and running <a href="https://www.techtarget.com/searchbusinessanalytics/news/365534255/Data-pipelines-deliver-the-fuel-for-data-science-analytics">complex data pipelines</a> in big data systems. It enables data engineers and other users to ensure each task in a workflow can access the required system resources and is executed in the designated order. Airflow is most commonly used to orchestrate data integration and transformation processes, machine learning (ML) operations, business applications and IT infrastructure management tasks, but it also supports other types of workflows.</p> <p>The platform has a modular architecture built around directed acyclic graphs that illustrate the dependencies between workflow tasks. Airflow pipelines are defined in Python and can be generated dynamically. Airbnb initially created Airflow for internal use, and the technology became a top-level project within the Apache Software Foundation in 2019.</p> <p>Airflow also includes the following features:</p> <ul class="default-list"> <li>Time- and dependency-based scheduling of workflows, plus an event-driven scheduling option.</li> <li>A web application UI to visualize data pipelines, monitor their production status and troubleshoot problems.</li> <li>Ready-made integrations with major cloud platforms and other third-party services.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="2. Delta Lake"> <h2 class="section-title"><i class="icon" data-icon="1"></i>2. Delta Lake</h2> <p>Delta Lake is a table storage layer that can be used to <a href="https://www.techtarget.com/searchdatamanagement/news/366545117/Lakehouse-architecture-the-best-fit-for-modern-data-needs">build a data lakehouse architecture</a> combining elements of data lakes and data warehouses. The Delta Lake framework creates a unified format for structured, semistructured and unstructured data, eliminating data silos that often <a href="https://www.techtarget.com/searchdatamanagement/tip/10-big-data-challenges-and-how-to-address-them">stymie big data applications</a>. It also provides common semantics for both batch and stream processing of table reads and writes.</p> <p>To ensure data integrity, Delta Lake supports transactions that adhere to the four <a href="https://www.techtarget.com/searchdatamanagement/definition/ACID">ACID properties</a>: atomicity, consistency, isolation and durability. A liquid clustering capability optimizes how data is stored based on query patterns, offering an alternative to traditional data partitioning. Databricks, a software vendor founded by the creators of the Apache Spark processing engine, developed Delta Lake and made the Spark-compatible technology open source in 2019 through the Linux Foundation.</p> <p>Delta Lake also includes the following features:</p> <ul class="default-list"> <li>Support for storing data in an open Apache Parquet format.</li> <li>Delta Universal Format, a function commonly known as UniForm that enables Delta Lake tables to be read in Iceberg and Hudi, two other Parquet-based table formats.</li> <li>A time-travel capability that provides access to earlier versions of data sets for audits and rollbacks.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="3. Drill"> <h2 class="section-title"><i class="icon" data-icon="1"></i>3. Drill</h2> <p>Apache Drill is a low-latency distributed query engine best suited for workloads involving large, complex data sets with diverse types of records and fields. The Drill website claims it can scale across thousands of cluster nodes and query petabytes of data using SQL and standard connectivity APIs. It handles a combination of structured and semistructured data, including nested data types such as JSON and Parquet files.</p> <p>Drill is built on a schema-free JSON document model and layers on top of multiple data sources, enabling users to query a wide range of data in different formats. It supports various file types and sources, including Hadoop SequenceFiles and event logs, <a href="https://www.techtarget.com/searchcloudcomputing/tip/Compare-NoSQL-database-types-in-the-cloud">NoSQL databases</a> and cloud object storage. Drill users can store multiple files in a directory and query them as a single entity.</p> <p>First released in 2015, the software can also do the following:</p> <ul class="default-list"> <li>Query data in most relational databases through a plugin.</li> <li>Work with <a href="https://www.techtarget.com/searchbusinessanalytics/feature/Top-business-intelligence-tools-to-know-about">commonly used BI tools</a>, such as Tableau and Qlik Sense.</li> <li>Run in any distributed cluster environment, although Apache ZooKeeper must be installed along with it to maintain information about cluster configurations.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="4. Druid"> <h2 class="section-title"><i class="icon" data-icon="1"></i>4. Druid</h2> <p>Apache Druid is a real-time analytics database with an interactive query engine that provides low query latency, high user concurrency, multi-tenant capabilities and instant visibility into streaming data. Hundreds or thousands of end users can simultaneously query data stored in Druid with no effect on performance, according to its developers.</p> <p>Written in Java and created in 2011, Druid became an Apache technology in 2018. Best suited for storing event-driven data, it's considered a high-performance alternative to traditional data warehouses. Like a data warehouse, Druid uses column-oriented storage and can load files in batch mode. However, it also incorporates features from search systems and time series databases, including the following:</p> <ul class="default-list"> <li>Compressed bitmap indexes to speed up searches and data filtering.</li> <li>Time-based data partitioning and querying.</li> <li>Flexible schemas with native support for semistructured data and nested data structures.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="5. Flink"> <h2 class="section-title"><i class="icon" data-icon="1"></i>5. Flink</h2> <p>Another Apache technology, Flink is a <a href="https://www.techtarget.com/searchdatamanagement/definition/stream-processing">stream processing</a> framework for high-performance distributed applications, including always-available ones. It supports stateful computations over both bounded and unbounded data streams and can be used for batch, graph and iterative processing. One of the main benefits touted by Flink's proponents is its speed: The software processes millions of events in real time with low latency and high throughput.</p> <p>Flink began as a university research initiative in Germany and became an Apache project in 2014. In addition to event-driven applications -- such as fraud or anomaly detection -- potential use cases include continuous data pipelines and both streaming and batch analytics. Flink runs in all common cluster environments and also includes the following features:</p> <ul class="default-list"> <li>In-memory computations with the ability to access disk storage when needed.</li> <li>Three layers of APIs for creating different types of applications.</li> <li>A set of libraries for complex event processing, ML and other <a href="https://www.techtarget.com/searchbusinessanalytics/feature/8-big-data-use-cases-for-businesses-and-industry-examples">common big data use cases</a>.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="6. Hadoop"> <h2 class="section-title"><i class="icon" data-icon="1"></i>6. Hadoop</h2> <p>Apache Hadoop is a distributed framework for storing data and running applications on commodity hardware clusters. First released in 2006 as a pioneering big data technology, it helps users handle large volumes of structured, unstructured and semistructured data. Hadoop is also at the center of a broader technology ecosystem that includes various related tools and frameworks for processing, managing and analyzing big data. While Hadoop has been partially eclipsed by Spark and other technologies, it's still used by many organizations.</p> <p>Hadoop includes these primary components:</p> <ul class="default-list"> <li>The Hadoop Distributed File System (<a href="https://www.techtarget.com/searchdatamanagement/definition/Hadoop-Distributed-File-System-HDFS">HDFS</a>) splits data into blocks for storage on cluster nodes, uses replication methods to prevent data loss and manages access to the data.</li> <li>Hadoop YARN schedules data processing jobs to run on cluster nodes and allocates system resources to them.</li> <li>Hadoop MapReduce, a built-in batch processing engine, splits up large computations and runs them on different nodes for speed and load balancing.</li> <li>Hadoop Common is a shared set of utilities and libraries.</li> </ul> <p>Initially, Hadoop was limited to running MapReduce batch applications. The addition of YARN in 2013 opened it up to other processing engines and use cases, but the framework is still most commonly used with MapReduce.</p> </section> <section class="section main-article-chapter" data-menu-title="7. Hive"> <h2 class="section-title"><i class="icon" data-icon="1"></i>7. Hive</h2> <p>Also an Apache technology, Hive is SQL-based <a href="https://www.techtarget.com/searchdatamanagement/feature/Evaluating-your-need-for-a-data-warehouse-platform">data warehouse infrastructure software</a> for reading, writing and managing large data sets in distributed Hadoop storage environments. It runs on top of Hadoop and processes structured data for summarization, querying and analysis. Hive supports ACID transactions, low-latency analytical processing and cost-based query optimization, the latter through integration with the Apache Calcite tool.</p> <p>In addition to HDFS files, Hive can access ones stored in the Apache HBase database and other systems. It also enables users to create and read Iceberg tables. Hive Metastore Server, its central metadata repository, provides data abstraction and data discovery features similar to those in traditional data warehouses. Facebook created Hive for internal use, and it became an Apache top-level project in 2010.</p> <p>Other key features include the following:</p> <ul class="default-list"> <li>HiveQL, a language with standard SQL functionality for data querying and analytics.</li> <li>Native support for cloud object storage services.</li> <li>MapReduce, Spark and Apache Tez as execution back-end options.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="8. HPCC Systems"> <h2 class="section-title"><i class="icon" data-icon="1"></i>8. HPCC Systems</h2> <p>HPCC Systems is a big data processing platform that LexisNexis Risk Solutions developed as an alternative to Hadoop and Spark. Befitting its full name -- High-Performance Computing Cluster Systems -- the technology supports data-intensive applications requiring speed and scalability on clusters built from commodity hardware. Its primary use case is enabling rapid data engineering for analytics applications in data lake environments.</p> <p>The platform includes these main components:</p> <ul class="default-list"> <li>Thor, a data refinery engine used to cleanse, merge and transform data for use in queries.</li> <li>Roxie, a data delivery engine that serves prepared data from the refinery to end users for querying.</li> <li>Enterprise Control Language, a programming language commonly known as ECL that's used for data management and query processing.</li> </ul> <p>HPCC Systems also includes a library of ML algorithms, plus tools for monitoring clusters and profiling, curating and governing data. While still primarily overseen by LexisNexis, it became open source in 2011 and is freely available to download under the Apache 2.0 license. The current release is a cloud-native platform that runs in Docker containers on <a href="https://www.techtarget.com/searchitoperations/definition/Google-Kubernetes">Kubernetes</a> in both the AWS and Microsoft Azure clouds. Deployments of the original bare-metal platform are also still supported.</p> </section> <section class="section main-article-chapter" data-menu-title="9. Hudi"> <h2 class="section-title"><i class="icon" data-icon="1"></i>9. Hudi</h2> <p>Apache Hudi -- pronounced <i>hoodie</i> -- is a platform for managing large analytics data sets stored in HDFS and other Hadoop-compatible file systems deployed in cloud object storage services. Short for "Hadoop upserts, deletes and incrementals," Hudi provides database-like functionality for ingesting and updating data to support real-time analytics in data lakes and lakehouses.</p> <p>First developed by Uber and an Apache top-level project since 2020, Hudi is built on an open table format that supports both Parquet and Apache ORC as the base file format. The platform integrates with Spark, Flink and other data processing and query engines. It supports ACID transactions, multimodal indexing to boost query performance and historical data analysis through a time-travel feature.</p> <p>Hudi also includes a data management framework that organizations can use to do the following:</p> <ul class="default-list"> <li>Simplify incremental data processing and data pipeline development.</li> <li>Improve <a href="https://www.techtarget.com/searchdatamanagement/feature/Data-quality-for-big-data-Why-its-a-must-and-how-to-improve-it">data quality in big data systems</a>.</li> <li>Manage data set lifecycles.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="10. Iceberg"> <h2 class="section-title"><i class="icon" data-icon="1"></i>10. Iceberg</h2> <p>Another Apache technology, Iceberg is an open table format for managing large analytics data sets <a href="https://www.techtarget.com/searchdatamanagement/opinion/Why-Apache-Iceberg-is-essential-for-modern-data-lakehouses">stored in data lakes and lakehouses</a>. According to the project's website, Iceberg is typically used in applications where individual tables contain tens of petabytes of data. The tables can be read from a single cluster node, without requiring a distributed SQL engine to sort through metadata and find the files needed for queries.</p> <p>To boost query performance, Iceberg tracks individual data files in tables rather than directories, using metadata files to maintain a snapshot log of changes to a table. It supports SQL commands to update, merge or delete data and enables multiple query engines to simultaneously read and write data in a single table. Created by Netflix for internal use, Iceberg became an Apache top-level project in 2020.</p> <p>Other notable features include the following:</p> <ul class="default-list"> <li>Schema evolution for modifying tables without rewriting or migrating data.</li> <li>Hidden partitioning that frees users from maintaining partitions and automatically updates table layouts as data or queries change.</li> <li>A time-travel capability, plus version rollback for resetting tables to a known good state.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="11. Kafka"> <h2 class="section-title"><i class="icon" data-icon="1"></i>11. Kafka</h2> <p>Apache Kafka is a distributed event streaming platform that supports data pipelines, data integration, streaming analytics and critical business applications. Created by LinkedIn and handed over to Apache in 2011, Kafka handles petabytes of data and trillions of event messages per day. It uses a publish-subscribe model to transmit messages and enables users to store event streams in distributed, fault-tolerant clusters for long-term use. Data streams can similarly be processed on the fly or later.</p> <p>To boost scalability, Kafka decouples applications that produce and consume event data and partitions the data across multiple storage servers, which are called <i>brokers</i>. It can be deployed on bare-metal hardware or in VMs and containers, both on-premises and in the cloud.</p> <p>The following are some of Kafka's other key components:</p> <ul class="default-list"> <li>A set of six core APIs for Java and the Scala programming language.</li> <li>Built-in stream processing capabilities for joining, aggregating, filtering and transforming data.</li> <li>Elastic scalability to up to 1,000 brokers per cluster.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="12. Kylin"> <h2 class="section-title"><i class="icon" data-icon="1"></i>12. Kylin</h2> <p>Apache Kylin is a distributed data warehouse and online analytical processing (<a href="https://www.techtarget.com/searchdatamanagement/definition/OLAP">OLAP</a>) platform designed to support large data sets and queries involving trillions of records. Kylin's storage layer is built on top of Delta Lake and Parquet. The platform includes a native compute engine added in 2024 that's based on Spark and Apache Gluten, a performance accelerator plugin for Spark.</p> <p>Internal data tables that Kylin manages directly were added along with the native engine. Kylin also still supports tables imported from data sources such as Hive, Kafka and Iceberg, but the internal tables offer greater flexibility for querying data. It provides a SQL interface for querying data and connects to Excel and BI tools such as Tableau and Microsoft Power BI. Initially developed by eBay, Kylin became an Apache top-level project in 2015.</p> <p>Kylin also offers the following features:</p> <ul class="default-list"> <li>Precalculation of multidimensional OLAP cubes to improve query performance.</li> <li>A data modeling and indexing recommendation engine.</li> <li>Combined analysis of streaming and batch data.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="13. Pinot"> <h2 class="section-title"><i class="icon" data-icon="1"></i>13. Pinot</h2> <p>Also an Apache project, Pinot is a real-time distributed OLAP data store that supports low-latency querying in analytics applications. According to its developers, Pinot handles petabytes of data containing trillions of records and concurrently processes hundreds of thousands of queries per second. To deliver the promised performance, Pinot has a fault-tolerant architecture with no single point of failure and supports horizontal scaling of clusters. Other configuration changes can also be done dynamically without affecting data availability or query performance.</p> <p>Pinot uses a columnar storage format and offers various indexing techniques to filter, aggregate and group data. To simplify data storage and replication, the system assumes all stored data is immutable. However, it supports upserts to keep streaming data sets up to date, as well as background purges of sensitive data to comply with privacy laws. Created by LinkedIn for internal use, Pinot became an Apache top-level project in 2021.</p> <p>The following features are also included:</p> <ul class="default-list"> <li>Near-real-time data ingestion from streaming sources, plus batch ingestion from HDFS, Spark and cloud storage services.</li> <li>A SQL interface for interactive querying and a REST API for programming queries.</li> <li>Integration with ZooKeeper for distributed metadata storage and Apache Helix for cluster management.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="14. Presto"> <h2 class="section-title"><i class="icon" data-icon="1"></i>14. Presto</h2> <p>Presto is a SQL query engine optimized for low-latency querying of large data sets. It supports analytics applications across multiple petabytes of data in data lakes, data lakehouses and other repositories. To further boost performance and reliability, Presto's developers are converting its core execution engine from Java to a C++ version based on Velox, an open source acceleration library. An early version of Presto C++ is available, but it has a limited set of connectors and doesn't support some of Presto's built-in query functions.</p> <p>Presto's development began at Facebook. When its creators left the company in 2018, the technology split into two branches: PrestoDB, which Facebook still led, and PrestoSQL, led by the original developers. In 2020, PrestoDB reverted to Presto, and PrestoSQL was renamed Trino. The Presto open source project is now overseen by the Presto Foundation, which is part of the Linux Foundation.</p> <p>Presto also includes the following features:</p> <ul class="default-list"> <li>Connectors to 36 data sources, including Delta Lake, Druid, Hive, Hudi, Iceberg, Pinot and various databases.</li> <li>The ability to combine data from multiple sources in a single query.</li> <li>A web-based UI and a CLI for querying, plus support for the Apache Superset data exploration tool.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="15. Samza"> <h2 class="section-title"><i class="icon" data-icon="1"></i>15. Samza</h2> <p>Apache Samza is a distributed stream processing system that enables users to build stateful applications for real-time processing of data from Kafka, HDFS and several other sources. It then writes the processed data back to some of the same systems. Use cases for Samza include event-based applications, real-time analytics and extract, transform and load (ETL) processes on streaming data.</p> <p>The Samza website says it can handle "several terabytes" of state data, with low latency and high throughput for data analysis. The system also supports stateless stream processing. It runs on top of Hadoop YARN or in a standalone deployment mode; the latter option enables Samza to be a component of larger applications and lets users implement Kubernetes or another cluster manager instead of YARN. Originally developed by LinkedIn, Samza has been an Apache top-level project since 2015.</p> <p>Other features include the following:</p> <ul class="default-list"> <li>A pair of high- and low-level APIs for different use cases, plus a declarative SQL interface.</li> <li>The ability to run as a lightweight embedded library in Java and Scala applications.</li> <li>Fault-tolerant features for migrating tasks in the event of system failures and rapidly recovering from them.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="16. Spark"> <h2 class="section-title"><i class="icon" data-icon="1"></i>16. Spark</h2> <p>Apache Spark is a unified data processing and analytics engine used for data engineering in both batch and streaming applications, as well as for interactive querying, ML and exploratory data analysis. Spark often outperforms MapReduce on batch processing, making it the top choice for such tasks in many <a href="https://www.techtarget.com/searchdatamanagement/feature/Building-a-big-data-architecture-Core-components-best-practices">big data environments</a>. It's also widely used as a large-scale analytics platform.</p> <p>Spark includes the following core modules and libraries to support its various use cases:</p> <ul class="default-list"> <li>Spark SQL, for processing structured and unstructured data via SQL queries.</li> <li>Spark Structured Streaming, a module for building streaming applications and data pipelines.</li> <li>MLlib, a machine learning library that includes various algorithms and related utilities.</li> <li>Dataset and DataFrame APIs, which are used to organize distributed data sets for processing.</li> </ul> <p>Spark runs on clusters managed by Hadoop YARN, Kubernetes or a standalone clustering tool built into the platform. It handles data from various sources, including HDFS, flat files and both relational and NoSQL databases. In addition to SQL, Spark supports Python, Scala, Java and R for programming. A Spark Connect feature enables client applications to connect to remote servers, simplifying development and deployment. Spark was created at the University of California, Berkeley, in 2009 and became an Apache top-level project in 2014.</p> </section> <section class="section main-article-chapter" data-menu-title="17. Storm"> <h2 class="section-title"><i class="icon" data-icon="1"></i>17. Storm</h2> <p>Storm, another Apache technology, is a distributed real-time computation system for processing unbounded data streams. Its use cases include real-time analytics, ML, continuous computation and ETL procedures on streaming data. The fault-tolerant system guarantees data processing, with multiple guarantee levels available to meet different application needs.</p> <p>The Apache Storm website says it can integrate with any message queueing system or database to access streaming data. Storm also supports any programming language for application development, and the system's out-of-the-box cluster configurations are suitable for production use. ZooKeeper is integrated to coordinate Storm clusters.</p> <p>Storm became an Apache top-level project in 2014 and also includes the following elements:</p> <ul class="default-list"> <li>A basic API and Trident, a higher-level interface for processing data in Storm</li> <li>Inherent parallelism that supports high data throughput with low latency.</li> <li>An experimental Storm SQL feature that enables SQL queries to run against streaming data sets.</li> </ul> </section> <section class="section main-article-chapter" data-menu-title="18. Trino"> <h2 class="section-title"><i class="icon" data-icon="1"></i>18. Trino</h2> <p>As mentioned above, Trino branched off from the Presto query engine and was originally named PrestoSQL. Like Presto, it's a distributed SQL engine for use in big data analytics applications. According to the Trino website, it supports low-latency analytics in exabyte-scale data lakes and lakehouses, as well as large data warehouses.</p> <p>Trino includes built-in connectors to 25 data sources, and seven external connectors are also available. It provides an interactive CLI for querying data, plus a plugin that lets users run queries in Grafana, an open source data visualization and dashboard design tool. In addition, Trino works with Tableau, Power BI and other BI and analytics tools, as well as Apache Superset and R.</p> <p>Trino is overseen by the Trino Software Foundation and also supports the following capabilities:</p> <ul class="default-list"> <li>Both ad hoc interactive analytics and long-running batch queries.</li> <li>Queries that combine data from multiple sources through a federation feature.</li> <li>Deployment in Kubernetes clusters and Docker containers.</li> </ul> <p><b>Editor's note: </b><i>TechTarget editors updated this article in February 2026 for timeliness and to add new information.</i></p> <p><i>Mary K. Pratt is an award-winning freelance journalist with a focus on covering enterprise IT and cybersecurity management.</i></p> </section> Numerous tools are available to use in big data applications. Here are 18 popular open source big data technologies, with details on their key features and use cases. https://cdn.ttgtmedia.com/visuals/searchDataManagement/integration_technology/datamanagement_article_015.jpg https://www.techtarget.com/searchdatamanagement/feature/15-big-data-tools-and-technologies-to-know-about Thu, 26 Feb 2026 00:00:00 GMT 18 top big data tools and technologies to know about in 2026 <p>When it comes to acting on data, timing is everything.</p> <p>What organizations once analyzed yesterday, they now need to understand immediately. Real-time data streaming is becoming essential infrastructure for competitive AI applications, and the gap between companies that can use it and those that can't is widening.</p> <section class="section main-article-chapter" data-menu-title="Why real-time matters now"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Why real-time matters now</h2> <p>The shift from batch processing to real-time streaming is more than a technical upgrade -- it's a fundamental change in how businesses operate. Traditional approaches to data analysis, where information is collected throughout the day and processed in scheduled batches, made sense when business moved more slowly. That world is disappearing.</p> <p>Markets move in milliseconds. Customers expect instant personalization. Operational issues must be caught before they cascade into failures.</p> <p>Consider fraud detection in financial services. Identifying a suspicious transaction in real time can prevent the crime, while discovering it hours later during a batch review usually means investigating after the fact. In manufacturing, streaming sensor data from equipment <a href="https://www.techtarget.com/searchbusinessanalytics/feature/Real-time-edge-analytics-use-cases-for-business">enables teams to predict failures proactively</a>, not just analyze why something broke. In retail, live analysis of browsing behavior and inventory levels enables dynamic pricing and personalization that batch processing cannot deliver.</p> </section> <section class="section main-article-chapter" data-menu-title="Implementing streaming with guardrails"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Implementing streaming with guardrails</h2> <p>The challenge is that the data volume and velocity have exploded. Connected devices, digital transactions and user interactions generate continuous streams of information that require immediate analysis to create value. The tools and infrastructure exist, but implementing them effectively requires rethinking how organizations approach data strategy.</p> <p>Real-time streaming introduces a layer of complexity that many enterprises underestimate. It's not just about speed. It's about building systems that handle continuous data flows, <a href="https://www.techtarget.com/searchenterpriseai/feature/Optimize-AI-models-to-generate-more-bang-for-your-buck">integrate with AI models</a> that make instant decisions and remain reliable when delays occur or failures ripple through the pipeline. The technical demands are significant, and the margin for error is small.</p> <p>Data governance is also more complicated in real-time environments. All the concerns about data quality, privacy and compliance that exist in batch systems become more acute when data flows continuously. Organizations need to implement controls that ensure compliance and data integrity without increasing latency.</p> </section> <section class="section main-article-chapter" data-menu-title="Start where seconds make a difference"> <h2 class="section-title"><i class="icon" data-icon="1"></i>Start where seconds make a difference</h2> <p>Despite these challenges, the competitive advantages are compelling. Enterprises that successfully implement real-time AI capabilities can respond to market changes faster than competitors, deliver superior customer experiences and improve operations to create measurable business value. The question isn't whether to embrace this shift, but how quickly you can do so effectively.</p> <p>Start by identifying where instantaneous analysis delivers the most value. Not every use case requires real-time data. Some assessments are <a href="https://www.techtarget.com/searchdatamanagement/feature/Building-a-big-data-architecture-Core-components-best-practices">perfectly suited to batch processing.</a> The goal is to focus on applications where immediate insights drive timely action: fraud prevention, dynamic optimization, real-time personalization or operational monitoring where minutes matter.</p> <p>Infrastructure decisions must align with business objectives. Stakeholders should evaluate streaming platforms based on throughput requirements, latency tolerances, integration with existing systems and operational complexity. Cloud-based products offer ease of deployment but may introduce vendor lock-in. Open-source options provide flexibility but require more internal expertise.</p> <p>Integration with AI systems is critical. Models should be optimized for low-latency inference. <a href="https://www.techtarget.com/searchenterpriseai/feature/How-to-build-a-machine-learning-model-in-7-steps">Feature engineering pipelines</a> must support both batch and streaming data. The entire system should be monitored to catch issues before they affect business outcomes.</p> <p>Most importantly, organizations need to build the cultural capability to work with real-time data. This requires cross-functional teams that understand both the business context and the technical requirements, workflows that enable rapid experimentation, and a willingness to iterate and improve as operational needs evolve.</p> <p>Real-time data streaming isn't a future capability -- it's a basic <a target="_blank" href="https://hbr.org/2026/02/why-your-digital-investments-arent-creating-value" rel="noopener">expectation</a>. Organizations that recognize this and invest accordingly will define the competitive landscape in their industries. Those that don't will find themselves perpetually reacting to competitors who can see and act on opportunities they're still processing.</p> <p><i>Stephen Catanzano is a senior analyst at Omdia where he covers data management and analytics.</i></p> <p><i>Omdia is a division of Informa TechTarget. Its analysts have business relationships with technology vendors.</i></p> </section> Don't let batch processing lead to missed opportunities. Build AI systems for continuous data flows that deliver instant decisions, change outcomes and justify the cost. https://cdn.ttgtmedia.com/visuals/digdeeper/2.jpg https://www.techtarget.com/searchdatamanagement/opinion/Real-time-data-streaming-for-AI-invest-where-it-matters Wed, 25 Feb 2026 13:47:00 GMT Real-time data streaming for AI: invest where it matters Search Data Management Resources and Information from TechTarget 60 webmaster@techtarget.com