Stories by Datuum on Medium

Data Integration Software: Build vs. Buy Dilemma

Datuum — Fri, 23 Jun 2023 11:07:54 GMT

When it comes to solving data integration challenges, product owners often encounter a critical dilemma: should they purchase a ready-made tool or develop an in-house solution? In this article, we will delve into the key factors that are worth considering when making this decision.

When it comes to managing product development, countless decisions need to be made on a daily basis. As a manager, you are equipped with a range of frameworks to help you in this process. In this article, our goal is to share our thoughts regarding one of the most critical decisions: whether to buy data integration tools or build them in-house.

Data integration encompasses a wide range of processes. Whether it’s a data extraction, transfer, cleansing, transformation, or mapping, each of these processes may present you with the build vs. buy dilemma. We’ll shed light on the key factors worth considering and introduce several frameworks that we’ve found to be highly effective in these specific cases.

Whether you are in the midst of building a new application or currently have one already servicing users in production, it is essential to actively assess your data needs and data management strategy. If your application is already in use, evaluating user satisfaction and devising plans to enhance its capabilities should be a priority.

Keep in mind that there is no definitive right or wrong answer to this decision. What works for one company may not work for another, as your choices will depend on your product roadmap and current business priorities.

Let’s explore the factors that are important to remember when it comes to the decision point for the build vs. buy software sourcing decision.

Software Build vs. Buy Considerations

Needs and Requirements

Assess the volume, variety, and velocity of data that needs to be onboarded.
Determine the complexity of the integration process, along with any unique or specialized requirements.
Identify the level of customization and control needed over the data onboarding process.

Expertise and Resources

Evaluate the availability of skilled resources within the organization to develop, deploy, and maintain an in-house solution.
Consider the expertise required in terms of data integration, data mapping, and system integration.
Determine if the organization has the necessary budget, time, and resources to allocate to building and maintaining an in-house solution.

Timeframe and Deadlines

Assess the urgency of data onboarding needs and evaluate if building an in-house solution can meet desired timelines.
Consider if purchasing a pre-built tool can provide a faster implementation, helping meet time-to-market requirements.

Scalability and Flexibility

Evaluate the scalability of both options and determine if the chosen solution can handle current data onboarding needs and future growth.
Consider the solution’s flexibility to accommodate changes in data sources, formats, and integration requirements over time.

Total Cost of Ownership (TCO)

Calculate the long-term costs associated with building and maintaining an in-house solution, including infrastructure, staffing, training, and ongoing enhancements.
Compare the cost of purchasing a pre-built tool, including licensing, support, and potential customization or integration expenses.

Vendor Reliability

Research the reputation and track record of vendors offering pre-built data onboarding tools.
Consider their experience, qualifications, and the expertise of their team members in the data onboarding domain.
Evaluate customer reviews and testimonials to gain insights into the vendor’s reliability, support, and responsiveness. Prioritize vendors who demonstrate transparency, effective communication, and a willingness to understand and address specific needs and challenges.
Verify if the vendor offers a user-friendly, no-code interface that empowers your diverse range of employees to utilize the tool effortlessly.

Integration Capabilities

Assess how well the pre-built tool integrates with existing infrastructure, databases, and third-party applications.
Consider the ease of integration and the effort required to connect the tool with current systems.

Customization and Future Requirements

Determine if the pre-built tool offers the necessary customization options to meet specific data onboarding requirements.
Evaluate the tool’s ability to adapt to future changes in the data ecosystem and integration needs.

By thoroughly evaluating these key software build vs. buy considerations, organizations can make informed decisions regarding whether to build an in-house solution, purchase a pre-built tool, or adopt a hybrid approach that combines elements of both options.

Making the Right Decision

When faced with the build-vs-buy dilemma, it’s crucial to ensure that you make the right choice. Luckily, we have three decision-making frameworks that can be extremely helpful in evaluating the most important factors: costs, benefits, strategic considerations, and overall suitability of each option.

Let’s take a closer look at these frameworks:

Cost-Benefit Analysis

This framework is essential for comparing the costs associated with building an in-house tool versus purchasing a pre-built one. It allows you to carefully weigh the advantages and disadvantages of each option in terms of financial investment, time commitment, resource allocation, and expected returns.

SWOT Analysis

The SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis helps you identify and evaluate both internal and external factors that can influence your decision. By assessing the strengths and weaknesses of each option and considering potential opportunities and threats, you gain a comprehensive understanding of the advantages and challenges associated with both building and buying a data integration tool.

Software build vs. buy matrix

A decision matrix is a powerful tool that enables you to evaluate and compare multiple criteria that are important to your organization. By assigning weights to each criterion and scoring each option against them, you can quantitatively assess the performance and suitability of building in-house versus purchasing a pre-built tool. This framework provides a structured and systematic approach to decision-making, allowing you to make an informed choice based on the specific priorities and requirements of your organization.

By utilizing these decision-making frameworks, you’ll be well-equipped to navigate the build-vs-buy dilemma. They provide a solid foundation for evaluating the key aspects of each option and ultimately making a decision that aligns with your organization’s needs.

Originally published at https://datuum.ai on June 23, 2023.

Data Supercharges Construction Tech

Datuum — Mon, 19 Jun 2023 12:28:09 GMT

Despite the constantly growing investments in ConTech, challenges like industry fragmentation, slow technology adoption, and scaling limitations hinder progress. McKinsey’s survey and analysis identify opportunities for ConTech firms while highlighting constraints such as point solutions and delayed payment terms.

The construction technology (ConTech) industry has experienced a significant surge in investments, with a remarkable 85% increase between 2020 and 2022, reaching $50 billion. This growth is evidenced by a 30% rise in the number of deals, totaling 1,229, according to the report from McKinsey&Company. Encouragingly, industry experts anticipate this trend to continue, with 77% of survey respondents expecting similar or increased investment levels in 2023.

While these metrics highlight the immense potential for digitizing the global $12 trillion construction industry, several hurdles impede further growth. Challenges such as industry fragmentation, slow technology adoption, and limitations in scaling products that cater to multiple customers hinder progress in the sector. In this article, we delve into McKinsey’s survey findings, which shed light on specific opportunities for ConTech firms, and explore the constraints that hamper tech innovation, including the prevalence of point solutions and delayed payment terms.

McKinsey’s survey of over 100 tech company founders, investors, and large software firms, combined with an analysis of 3,000 tech companies in the architecture, engineering, and construction (AEC) space, identifies several areas of opportunity for ConTech firms.

Fragmentation of Industry Players: A Data Integration Challenge

One of the significant hurdles in ConTech is the fragmentation of industry players, impeding seamless collaboration and effective communication. To address this, ConTech firms might leverage data integration tools like Datuum. By integrating data from multiple sources, Datuum empowers stakeholders across the construction ecosystem to access real-time insights, break down silos, and enhance collaboration. Data integration enables decision-makers to coordinate efforts, optimize resource allocation, and drive operational efficiency, promoting a more connected and collaborative industry.

Slow Technology Adoption: Showcasing Value through Data

Slow technology adoption within the construction industry hampers progress and innovation. ConTech companies can overcome this challenge by utilizing data to showcase the tangible value of their solutions. Through data-driven demonstrations and compelling case studies, these firms might provide concrete evidence of how technology improves productivity, efficiency, and project outcomes. By leveraging data analytics, ConTech companies might highlight the positive impact of their solutions on cost savings, time management, risk mitigation, and quality control, ultimately inspiring traditional construction companies to embrace technology and accelerate adoption.

Scaling Products for Multiple Customers: Personalization Powered by Data

The complexity of scaling products to meet the diverse needs of multiple customers presents a significant challenge in the ConTech space. Data plays a vital role in tailoring solutions to cater to various stakeholders. By harnessing advanced analytics and machine learning algorithms, ConTech companies might derive valuable insights from integrated data sources. These insights enable the customization of products to address the unique requirements of owners, contractors, subcontractors, architects, and other industry participants. Data-driven personalization ensures that ConTech solutions closely align with customer needs, driving adoption and enhancing market penetration.

Payment Terms and Late Payments: Data-Driven Financial Optimization

Late payments and unfavorable payment terms are prevalent in the construction industry, impacting the cash flow of ConTech vendors. Data can be instrumental in addressing this challenge by optimizing financial processes. ConTech companies might utilize data analytics to analyze payment trends, identify bottlenecks, and propose alternative payment structures. By leveraging data-driven insights, such as cash flow projections and payment performance analytics, ConTech firms might collaborate with industry stakeholders to establish transparent payment processes, reducing late payments, and fostering a healthier financial ecosystem.

Fueling Success with Data

The ConTech sector holds immense potential for growth and transformation in the construction industry. Embracing a data-driven approach allows ConTech companies to overcome challenges and seize opportunities. By harnessing data integration tools, showcasing the value of technology through data, customizing solutions based on data insights, and optimizing financial processes with data analytics, ConTech companies can pave the way for accelerated growth.

Read the full report here

Originally published at https://datuum.ai on June 19, 2023.

Interoperability Simplified: How to Map Your Data to FHIR at a Fraction of the Cost

Datuum — Mon, 05 Jun 2023 13:49:08 GMT

Many healthcare providers still rely on outdated flat files for data exchange, requiring IT and data integration teams to construct complex pipelines and convert EHR/EMR data into FHIR resources. This process is time-consuming, expensive, and difficult to maintain. Discover a simpler solution with Datuum.ai, automating the mapping process for smoother interoperability and cost savings.

In today’s interconnected world, the importance of seamless data exchange and interoperability cannot be overstated. Enter FHIR (Fast Healthcare Interoperability Resources), the standardization solution that tackles the exorbitant costs of data integration head-on.

By adopting FHIR, healthcare professionals gain a remarkable power: a comprehensive view of patients’ medical histories across various clinics and hospitals, empowering them to make well-informed decisions. Moreover, FHIR simplifies the integration of medical devices from different manufacturers, effortlessly facilitating data exchange. This not only saves substantial time and money but also enhances the overall efficiency of healthcare operations.

However, let us not disregard the challenges that come hand in hand with adopting FHIR. While many organizations aspire to embrace this technology, mapping existing data exchange interfaces to FHIR can present a significant obstacle akin to fitting a square peg into a round hole.

Bridging the Gap in Converting EHR Extracts to FHIR

In numerous cases, healthcare providers still rely on outdated flat files for batch data exchange. While specific interfaces support real-time integrations with FHIR messages, it falls upon the IT/Data Integration teams to construct the data pipelines and convert EHR data extracts into FHIR resources. A tedious, time-consuming process that proves challenging to maintain and can incur substantial costs.

Let us recount an inspiring tale featuring one of our customers — a HealthTech provider. They embraced FHIR as their internal Canonical data model. However, they encountered a roadblock when they sought to onboard new clients (hospitals) who persisted in sending the traditional EHR flat files. Soon enough, they realized that implementing and supporting these integrations consumed a disproportionate amount of time and resources. Driven by the desire to find a better solution, they embarked on a quest.

Picking the Right Solution

Their quest entailed finding a solution capable of saving time and money while seamlessly integrating into their existing infrastructure without causing disruption. Moreover, it had to possess the flexibility to accommodate multiple data ontologies.

In their quest, they discovered Datuum.ai, the ultimate ally for this undertaking. Datuum’s pre-trained AI engine analyzes all inbound data, creating a mapping between the source data and FHIR ontology, subsequently generating FHIR resources.

Furthermore, the entire process unfolds through a user-friendly no-code interface, obviating the need for developers to expend valuable time on data mapping and integration tasks. Positioned at the inception of the data journey, Datuum seamlessly integrates into the pre-existing data infrastructure. A notable advantage of Datuum is its transparency, empowering data analysts to comprehend the transformation process and address any issues that may arise.

Unlocking the Potential of Interoperability

By harnessing the power of Datuum’s no-code tool, the company triumphantly overcame the challenges associated with adopting FHIR. This achievement not only established smooth data exchange capabilities but also expedited the onboarding process for their customers. Consequently, the company experienced remarkable time and cost savings while underscoring its commitment to ensuring customer satisfaction.

Case Study to Download

Learn more details: Welcome Datuum

Optimizing Data Integration Strategies: Insights for Product Development Leaders

Datuum — Wed, 24 May 2023 17:44:21 GMT

Data integration presents challenges in diverse data sources and mixed tool sets. In this article, we address these hurdles and provide strategies to overcome them, enabling organizations to leverage data effectively in a data-driven world.

In the realm of data-driven organizations, the task of integrating data from multiple sources is no easy feat. Data integration plays a pivotal role in gaining a comprehensive and accurate view of business operations, customer behavior, and more. However, the process is riddled with challenges that demand careful consideration. As highlighted in recent TechTarget’s publication “8 Data Integration Challenges and How to Overcome Them,” two key hurdles grabbed our attention: handling diverse data sources and managing mixed tool sets and architecture.

Handling Diverse Data Sources: Unifying the Unruly

Gone are the days when data integration was a matter of dealing with structured data from a few conventional sources. Instead, the modern landscape is a vibrant tapestry of data diversity, including streaming data, social media feeds, public datasets, and ecosystem data.

To navigate this intricate landscape, a careful approach is needed. By identifying the most demanding elements within diverse data assets, we can tackle them separately, employing robust tools tailored to handle specific data types. For example, some innovative companies have achieved automated transitions from EHR/EMR records to FHIR messages within a matter of hours using our no-code data integration tool Datuum. Another popular strategy gaining momentum is the utilization of a well-designed data lake combined with Apache open-source technologies.

Managing Mixed Data Tool Sets and Architecture: Harmonizing Data Integration Environments

Integrating data often entails working with a mix of tools and platforms, which can lead to a complex integration environment. It’s like managing a symphony orchestra with each musician playing their own tune-a challenging balancing act.

To achieve harmony within this intricate ensemble, attention must be turned to the cloud. Cloud platforms offer integration tools specifically designed to address diverse challenges, allowing for a more streamlined integration process. However, it’s essential to remember that success lies not only in the tools but also in the techniques employed. Savvy data engineers and integration developers understand that flexibility and customization are key. They achieve optimal integration results by employing a strategic blend of coding and utilizing a select set of tools.

Additionally, documenting integration processes, cataloging integrated data, and ensuring data integrity and availability are crucial aspects of managing mixed tool sets and architecture. These measures provide clarity, ease of use, and robustness, enabling smooth navigation through the integration landscape.

Building a Data-Driven Future

Data integration is a challenge that product development teams can overcome. By embracing diverse data sources and managing mixed tool sets, they can create software applications that thrive in a data-driven world. Empower your teams, foster collaboration, and build an integration architecture that leads to success.

For more insights on data integration challenges, check out the publication “ 8 Data Integration Challenges and How to Overcome Them “ published by TechTarget.

Originally published at https://datuum.ai on May 24, 2023.

How to hack the way to efficient data extraction

Datuum — Wed, 17 May 2023 13:28:46 GMT

Extracting data from an existing database is one of the most challenging and tedious jobs that typically fall on the plate of the Data Engineering team. And when it comes to dealing with true legacy systems, well, let’s just say it adds an extra layer of intrigue to the mix. Unraveling the secrets left behind by generations of engineers is like solving a complex puzzle. Inconsistent data modeling, along with potential challenges like insufficient documentation, limited knowledge of the existing team, and unclear semantics, are just a glimpse of the challenges ahead. Yet, amidst these daunting obstacles, the team is expected to deliver results with lightning speed.

In this article, we aim to impart insights gained through hard-earned experience, offering a lifeline that may save you precious time.

So, let us explore these pivotal non-technical starting points:

Identify a knowledge holder: Even if a system/database is very old, there is usually at least one person who has some understanding of how it works and where data is coming from. This individual can be an invaluable resource to you, but keep in mind that they may be busy, have knowledge gaps, or may not like your project at all. Nevertheless, building a positive relationship with this person can be critical to your success. They can help you save significant time and effort. Do not criticize the current state of things, even if there are obvious flaws and improvement opportunities. Focus on extracting the data as efficiently as possible with minimum dependencies on the source system.
Manage project stakeholders’ expectations: It’s important to educate stakeholders about the nature of data extraction projects. Many people mistakenly assume it’s a one-time task, often due to a lack of experience in this area. However, unless you’re working with a very small database, it’s likely to require multiple iterations to ensure that the data is complete and accurate. Therefore, it’s essential to communicate realistic timeframes and the need for iterative processes to stakeholders. It’s a very challenging task to estimate that effort. So try to find the most critical piece of data you can focus on initially.
Establish a shared understanding of semantics with both the users of the source system and the end users of the data: In any domain, people tend to use different terms for the same things. Therefore, involving those who use the source system can help you explain the data behavior and identify any data anomalies that may arise. For instance, when we had to extract data from a widely-used EHR system, it was a surprise that patient information was stored in a table named ‘users.’ By collaborating with the end-users and the data source users, we were able to address such semantic inconsistencies and prevented any confusion down the line.
Establish early success criteria: There are always some standardized reports generated from the legacy system. Typically you can use those reports as a set of reconciliation metrics. Don’t be surprised that you discover some inconsistencies and/or errors during your reconciliation process. But it’s always helpful to establish some baseline to evaluate the fullness and quality of the data.

Ok, and what about the technical side of the project? There are many approaches and tools you can use to do the job. From plain SQL all the way to no-code tools. It’s up to you and your arsenal of skills, but here are some high-level suggestions that typically help:

Change Data Capture (CDC): It’s essential to understand how the data changes over time, including retrospective changes, in order to establish reliable Change Data Capture for the tables you’re interested in. Many legacy databases don’t have an OOTB CDC feature, so you may have to deal with a full data refresh. An alternative would be to use a data replication tool with CDC capabilities integrated. Tools like Airbyte and Fivetran may help to save time there.
Analyze the workload to identify the most and least used objects/tables: Table names alone can be misleading, so analyzing the database workload can help you better understand which tables are critical for your extraction process. It can help you to identify critical objects in the data model.
Extract and store data in its raw form: It’s always a good practice to extract the raw data and store it as is. Don’t perform any transformations in that step. Use your Data Lake or Staging area to store it until the iteration/project is complete. This minimizes the performance impact on the source database. This also enables you to compare the extracted data from the previous run to the current version, helping to identify changes you may have missed during implementation. Another advantage is the ability to rerun transformations without impacting the source database.
Structure transformations in multiple stages: When it comes to transforming the extracted data, it’s best to break it down into multiple steps rather than cramming everything into a single, complex set of instructions. This helps with the maintainability of the data transformation code and makes it easier for you and your team to understand what’s going on later on. Overcomplicated code can be a hindrance, especially when revisiting it after several iterations. Good documentation and structured code can save you and your team time.
Consider using a no-code tool: Maintaining both the code and the documentation can be a lot of work. Having a no-code tool that can handle data lineage, transformations, and documentation can be helpful. When selecting a tool, always consider who will maintain the pipeline and how to make that process as efficient as possible. How technical that person will be. Some tools require more technical skills than others.
Create a Semantic Model: A semantic model can be extremely useful when integrating data from multiple sources into a single data model. Normalizing data and using transformations that can be demonstrated are highly beneficial features. This approach ensures that your data is well-structured and easy to understand, making it more accessible and simpler for data users to access and analyze.

Data extraction can be challenging, but following these key takeaways can help you avoid issues, especially if you’re doing it for the first time. We intentionally skipped all parts related to security and compliance and assumed you have access to the database.

It’s essential to build strong relationships with knowledge holders, manage project stakeholders’ expectations, establish a common definition of semantics with data users, set clear success criteria, and structure transformations in multiple stages. No-code automation tools like Datuum and data replication tools like Airbyte or Fivetran can simplify the process.

Documenting transformations, maintaining data lineage, and creating a semantic model is crucial to establishing a sustainable and maintainable data pipeline. These steps will benefit the entire organization in the long run.

Originally published at https://datuum.ai on May 17, 2023.

B2B Customers Speak Up: Only 29% Feel Fully Engaged During Onboarding

Datuum — Thu, 11 May 2023 08:43:46 GMT

Gitnux Journal recently shared customer onboarding statistics revealing surprising facts.

Customer onboarding is a crucial aspect of any business’s success, and it is widely acknowledged that companies that prioritize customer onboarding can experience significant benefits, such as higher customer retention rates, increased revenue growth, and better customer satisfaction. However, Gitnus’ study shows that despite the benefits of a welcoming onboarding experience, 85% of customers experience a drop in satisfaction during the onboarding process, which is a surprising and concerning statistic. This highlights the need for businesses to invest more resources in improving their onboarding process to create a smooth and efficient customer experience.

Businesses that invest in onboarding processes can grow revenue by up to 5–10%, and high-performing companies are 17% more likely to have a well-defined customer onboarding strategy. An effective onboarding process can also reduce the time it takes to achieve the first value by 34%. Companies that invest in customer success management can increase their customer retention by up to 10%, and customers are about 20% less likely to switch providers when they receive good treatment.

These statistics demonstrate the critical role that proper customer onboarding plays in any company’s long-term success. At Datuum, we understand that onboarding can be a challenging process for businesses, but it’s essential to get it right. Our AI-driven tool helps Customer Success Managers and Implementation Specialists create a positive customer experience, resulting in better customer satisfaction and retention and, ultimately, increasing profitability. With Datuum’s AI-powered no-code tool, businesses can easily automate their onboarding process, reducing manual labor and streamlining the overall experience.

Read the full report here.

Originally published at https://datuum.ai on May 11, 2023.

Is there a way to reduce costs in your Data Engineering department?

Datuum — Mon, 08 May 2023 10:17:56 GMT

Many organizations face a significant disconnect between their Business and Data Engineering teams. As a result, data Engineers are often overwhelmed by a growing number of requests, leading them to apply temporary fixes and create one-off data pipelines that need more scalability and resilience. Unfortunately, this approach results in a snowball effect of changes and bugs that require urgent attention.

Meanwhile, Business teams continue to generate ad-hoc requests and demand high-quality data at an accelerated pace.

What is the solution?

If we peel the onion, we can see that not all Data Engineering tasks are equal, and not all Data Engineers possess the same skill sets.

This idea may sound controversial, but it’s worth considering that many tasks could be accomplished with low-code or no-code tools by Data Analysts.

Data Engineers should not be burdened with building individual data pipelines and handling ad-hoc requests. Instead, they should focus on constructing a Data Platform that enables multiple teams and individuals to build on top of it.

By doing so, the platform can be used to its fullest potential, resulting in greater efficiency and reduced costs for the organization.

It might be controversial, but data analysts could do many tasks with low-code, no-code tools. At the same time, Data Engineers shouldn’t be working on building individual data pipelines and ad-hoc requests. Instead, they should assemble a Data Platform that allows multiple teams and people to build on top of that.

Here are the key components for this model to succeed:

Data Platform foundation
Robust semantic layer
No-code toolset
Data Governance

While the components required for this model’s success are straightforward, there are several important considerations to keep in mind:

80% of data in the organization is never used (maybe once), so it’s essential to focus on business-critical pieces of the data and establish a proper Data governance procedure to minimize the Data Bloat
Without a robust semantic layer, Data Analysts (and other consumers) will struggle to interpret the data and get it right even if you create a great Data Model. A good semantic layer, in combination with no-code tools, can significantly increase the efficiency of the Data Team.
Not all Data pipelines are meant to be equal. One-time ad-hoc queries are inevitable but can be done with no-code tools to reduce the burden of the Data Engineering team.

Context Matters in Semantic Ambiguity

Datuum — Mon, 08 May 2023 10:05:45 GMT

Context matters. Let’s take, for instance, the following list of strings:

If we assume all items in the list above have the same semantic value, what is it exactly?

The obvious answer to this question is “geographical place,” right? However, let’s look at it in a broader context.

Now we know we were looking at the list of surnames, not cities. We had no way of telling the difference without extending the context.

In this particular case, a larger sample size would barely help, as nearly any city name can be a surname for someone, and many cities are named after some people.

In everyday situations, our brain doesn’t even notice how tricky this kind of distinction is because, as humans, we rarely operate without a rich context.

NLP techniques have some of the required contexts for token classification from the text surrounding any particular token and use word embeddings to make a “best guess” of what this word might mean.

When we analyze columns of data (let’s say, from a CSV file), we don’t have any sentence to derive a context from. Instead, we have other columns and, more often than not, other files.

Datuum.ai technology actually utilizes all the context which is available in a data source and takes a full set of columns, their order, full set of files/tables, etc., into account in order to determine a semantic type of the data.

It took a lot of time and countless iterations to come up with neural network architecture and feature set which would allow us to achieve what we did achieve. There are still a lot of improvements ahead.

And this is just the tip of the iceberg, one rather simple problem of many we are solving to make our platform work.

Thanks to Dmytro Zhuk, Founder & CTO at Datuum.ai, for the story!

Previously published here.

Tabular data as a challenge

Datuum — Mon, 08 May 2023 08:17:33 GMT

The abundance of tabular data being accumulated by enterprises internally and online represents a big challenge for data integration and analysis for companies that strive to be data-driven.

While the main focus of AI/ML research and community during the last several years has been on processing unstructured data, tabular data still is where the most time and money are spent in the Data Integration world.

These days every company, whether big or small, has dozens of different systems (data sources). Of course, many of those have excellent integration capabilities. However, every day, people still export data from various systems and ingest it into proprietary enterprise Data Lakes, Data Warehouses, Reporting databases, etc.

That requires a lot of time and qualified Data Engineers.

To keep up with quickly changing business requirements and growing data size, there have been many great technologies and tools developed by companies and communities: big data stack, data wrangling tools, quality and observability, visualization, and NLU interfaces to the data.

All those technologies allowed the collection and processing of unprecedented amounts of data. Though it also makes the modern data engineering stack very complex, which, in turn, results in long and expensive development cycles that don’t meet business requirements

Tools assume that the user understands the data, knows how it’s structured, and how to cross-reference data that comes from different sources.

And that represents probably the biggest problem in today’s corporate world. In order to do something with the data, you have to understand it, know where it is, and how it’s structured. There is a misconception that there are people who fully understand all the nuances of companies’ data and can efficiently manage it.

And it becomes exponentially more complicated when a company grows, and people change.

In its work, Datuum explicitly targets this challenge by applying state-of-the-art AI models to help Data Workers to handle the variety and complexity of tabular data. Datuum speeds up and optimizes the integration of multiple data sources by helping data workers to find the data they need and connect the dots between source and destination. As a result of interacting with your data, our AI learns more with every iteration, and that data so the institutional knowledge is not lost.

Why is it not easy?

Imagine having two different data sources with seemingly similar data stored in them. For instance, data from medical or insurance domains, in which some tables could contain hundreds or even thousands of columns. Our primary task is to unify this data into a single source of truth.

The cornerstone of this unification is the process of column matching — discovering combinations of columns from different data sources which store the same information (Fig. 1).

[Figure 1]: Table cross-reference comprises column and record matching

While it could be trivial to match columns with generalized data like name, email, or address, it’s not the case for columns containing domain-specific information — to map this kind of data, we usually need domain expertise and manual intervention.

The other critical phase of the unification pipeline would be record matching — the process in which records from different data sources are merged into a single entity. For highly regulated domains like healthcare, this step could be crucial. And it’s even harder to get right when large datasets are being processed, even for human annotators.

Transformers to the rescue!

Thankfully, the latest research in large language models gives us the key to simplifying the matching process. It doesn’t yet thoroughly remove the need for human experts (especially in complex domains) but could give them a substantial boost compared to standard setting.

By definition, language modeling is the usage of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models train by analyzing text to correctly predict the next or masked word in a sequence (Fig. 2).

[Figure 2]: One of the possible language model objectives is to predict a masked word in an input sequence. Models trained with this objective are called Masked Language Models.

The task of language modeling is not new, but in recent years it has become one of the main areas for research and development in the machine learning field. Basically, there are three reasons why this sub-field experienced such enormous progress:

The nature of language modeling task allowed us to use tremendous amounts of data for unsupervised learning
Transformer attention-based architecture allowed the massive parallelism of training on sequential data.
Transformers pre-trained on a language modeling task turned out to be great at solving other, unrelated to language modeling, tasks.

As a result, transformer models have found use in a wide range of tasks, ranging from question answering to video understanding. So naturally, people also looked at applying this architecture to tabular data as well.

Tabular data is textual in its nature. One could argue that numeric columns do not constitute text, but in fact, numerical values are nothing other than sequences of specialized text characters. This means that tabular data, similar to textual one, could be effectively modeled as a language using Transformers.

Modeling tabular data with Transformers facilitates the solution of other downstream tasks, such as column type prediction or column name recovery. Some of the proposed models could also learn efficient dense representations of the table parts: columns and rows.

Deep learning journey

Machine learning approaches using tabular data can hardly be called new. For many years, tree-based solutions, such as random forests or gradient boosting algorithms, were used for various classification and regression tasks, such as credit score prediction or fraud detection.

However, the complexity of the models above lets us learn and understand data patterns on the observation (row) level.

For a much deeper understanding of data, neural networks became very handy as relations between columns in a table or between tables in a database.

In the pre-transformer era, deep learning models had a limited capability in processing sequential data like column values or names and, hence, learning contextual information from it. As a result, they often relied on manually engineered features for model input and processed each column separately.

One prominent example of such a model is Sherlock [1] — the first successful application of deep learning to the task of semantic type classification. Sherlock is a regular feed-forward neural network that uses a set of statistical features calculated from column values to predict one of 78 semantic column types. It was the first deep-learning model used to improve column matching in Datuum.

Despite being quite effective for common column types, like name, address, or zip code, Sherlock had a set of limitations, such as high reliance on annotated data, especially in previously unseen domains, or inability to take table context and column name into account.

Obviously, that represented an obstacle to successfully applying Datuum in new verticals fast, as that architecture required additional training on the new data.

Fortunately enough, the Transformer-based model proved to be a perfect alternative to Sherlock.

The case for transformers

Our history with applying transformers to the task of column matching started with extensive research of related work. As a result, we chose TABBIE [2] and TaBERT [3] models for further experiments. It should be noted that at the beginning of 2022, almost all SOTA approaches to the task of column-type classification were based on Transformers (Fig. 3).

[Figure 3]: The idea behind the TABBIE model is to use semi-supervised training with large amounts of tabular data to get powerful representations of cells, columns, and rows.

By fine-tuning these models on multiple relevant datasets, we proved the superiority of transformer-based methods over Sherlock. For example, the TABBIE F1-macro score on the TURL dataset [4] was almost 20% higher than Sherlock’s.

In fact, the performance of transformers on classification tasks wasn’t what we were looking for. The property that caught our attention most was the representational ability of these models.

For instance, by training transformer-based models like TABBIE on domain-specific datasets for the semantic column type classification task, we could learn dense representations of the elements of these tables: rows and columns. This, in turn, allowed us to extend the application of the model to previously unseen domains without the need to annotate new data.

Our column-matching pipeline is specifically designed for a system that has a regular inflow of new data from previously unseen domains (Fig. 4).

[Figure 4]: The representational ability of the TABBIE model allows us to improve column-matching performance on domain-specific data without the need to annotate it.

Generally, our approach works the following way:

We fine-tuneTABBIE model with domain-specific data for a task that does not require additional annotation—for example, corrupted cell detection or cell in-filling. Such training is called semi-supervised.
The model learns dense representations of columns that leverage both column information and table context.
At runtime, we encode columns from tables that must be matched and combine them into matching pairs based on distance measures between dense representations in vector space.
Human experts review low-confidence matches for correctness.

This pipeline has few important benefits compared to the pipeline based on the Sherlock model, namely:

Matching is not limited to a defined set of column types.
The need for annotation is eliminated.
Ability to include encoded column data, metadata, and context into a single representation.

Combined, these advantages set a new standard in the accuracy and adaptivity of column matching employed in Datuum.

Conclusion

The representational ability of Transformer-based language models is bringing new capabilities to almost all tasks that rely on textual data, and table cross-reference is no exception.

By employing a Transformer model in our column-matching pipeline, we were able to increase the matching accuracy and generalization ability of the system that doesn’t depend on annotated data, which is especially useful in domains where data access is restricted.

References

[1] https://hci.stanford.edu/~cagatay/projects/table-understanding/Sherlock-KDD19.pdf

[2] https://arxiv.org/pdf/2105.02584.pdf

[3] https://arxiv.org/pdf/2005.08314.pdf

[4] https://arxiv.org/pdf/2006.14806.pdf

Originally published on Datuum.ai