What is Data Integration?

There are many types of integration. If you remember our blog on what iPaaS is, we have listed a bunch of those. System integration, application integration, SOA integration, IoT integration, B2B integration, and of course, data integration. Nevertheless, data integration is probably used the most frequently to describe that someone needs integration solutions. Though often, when one needs to integrate data, that will also include other types of integrations, too. This article will explain in detail what data integration is, the challenges of it, and what tools can be used to move data from one source to another.

What is Data Integration?

Data integration is the routing of data from multiple disparate sources (databases, systems, applications) into a single system/platform within an organization or across multiple ones. The reconciliation of heterogeneous data into one homogeneous format is an enormous opportunity for companies, as data integration is the key to better collaboration and, consequently, better customer service.

Nevertheless, data integration is challenging, and in a bit, we will talk about it in more depth. First of all, the number of sources can be overwhelming; often, the same application is configured differently in different customer environments, and the data is often autonomous (meaning it belongs to someone that does not want to provide access to someone else).

Before diving deeper into what challenges firms and integrators face when they need to integrate information from data siloes, let’s have a look at customer data integration and big data integration.

Customer Data Integration (CDI)

As a customer, have you ever had an experience in which you talked to one customer rep about your problem, then on the next day you spoke with someone else from the same firm, and they had no information on the first conversation?

Yes, this means that the customer data was poorly integrated.

Did you have an experience when all customer representatives were aware of your problem, so you didn’t need to explain it again and your issue was resolved much faster than in the other case?

That is customer data integration done right.

Which one did you prefer?

The ultimate goal of customer data integration is to collect the information on the customers, process, control, and automate it, and ensure that all the data is available to everyone in a single source, despite where the source of information is coming from.

Some may debate whether it can be considered as data integration, as typically other concepts are used too, such as data warehousing, data governance, business process integration, and so on. Still, we think it’s a form of data integration that is growing in popularity.

For example, many retailers and insurance companies want us to help them with CDI, so they can have a 360-degree view of their customers by pulling information from separate systems and databases into a single location.

After all, you answered the question above just like everyone else: you prefer when the company you do business with knows your problem and understands your pain and they can cure it as fast as possible.

By having all the information in one place, companies can offer more personalized experiences to their customers and have better touchpoints with them.

Big Data Integration

A few years back, I wrote my thesis on big data (it was the time when the hype had started). Little did I know that a few years later, I would be explaining how critical integrations are in the lifecycle of big data.

Organizations are collecting a vast amount of data not only about their customers but all aspects of the business, and also Internet of things is starting to generate more and more data.

Frankly, without integration, your big data has a lot less value than it would have otherwise (and as we know, data is your most valuable asset).

Integrating the data and transferring it into a single source is the top priority. Only then your data scientist can start doing their magic to turn raw data into insights that your analysts can then utilize to improve your operations.

As timing is critical, it’s important to have the data available in real-time instead of receiving them in batches.

Data Integration Challenges

Let’s talk about the challenges of data integration only on a very high level, as this is a topic we could blog about on its own, too (and probably we will do so in the future, so stay tuned!)

Many see data integration as too complicated, and while it’s true today, it’s not impossible to overcome some of the challenges.

In an ideal world, all the standards and schemas would simplify doing integrations. Sadly, it’s not like that. Nevertheless, all you need is an excellent integration platform and a few integration experts to develop your integrations.

1. Systems

As we said above, systems play a crucial role in data integration. To transfer all the necessary data into a single platform/system, we need to enable systems to talk to each other. This is where it gets difficult. Systems were built at very different times, with different technologies, utilizing different internal data formats and providing different interfaces. Furthermore, live business systems are often being further developed and updated; the existing specifications are constantly changing.

Data integration would be challenging even then if all the multiple sources we need to extract data from would be running on the same hardware, and all of them, for example, let’s say, support SQL standard and Open Database Connectivity.

The problem with SQL is that while it’s a standard query language, implementation can differ company by company, so you need to be aware of it while building integrations between the databases. Although data integration is difficult when one needs to execute queries over multiple disparate sources, still, in this example, at least all the systems are using SQL.

In most cases, data integration is not this “simple”.

Data often needs to be integrated across different organizations, so you might need to deal with firewalls, and security is a top priority. Firms operate on very different hardware and software. Some of these may be legacy architectures, while others cloud-based SaaS applications.

Still, the goal remains the same: all the relevant data must be available in a single platform for all the necessary stakeholders, preferably in real time.

The more sources you need to use, and the more different these systems are, the more challenging the data integration gets from the system's point of view.

2. Logic

If you’re familiar with integrations, you know that developing the ultimate solution usually involves lots of logic.

It’s the same when you are dealing with data integrations. The way the data is logically organized within the data sources has a significant effect on how challenging the integration is.

Structured data sources are organized by a schema that typically specifies tags, classes, and properties and can also determine a set of tables and attributes for each table or it. This can mean that while two databases are exactly the same, two architects would define the schema in two completely different ways. This means that data coming from various sources can look very different.

Representation of the data can be different, e.g., data fields are used differently. This means that matching records can be challenging.

In modern days, schemaless/NoSQL databases make things even more interesting, but it is completely another story.

When developing solutions, the integration architect always takes these differences into account and develops logic to overcome the semantic heterogeneity of the information.

3. Other reasons

There are a few other reasons complicating data integration projects.

First of all, there are security and legislative reasons. Enterprises may have policies that do not allow for sharing the data outside the organization, or even if they do, they may use firewalls. In some countries, there are also legislative restrictions, for example, storing data in the cloud.

In some companies, some data owners may not be willing to share information with other departments within the enterprise.

Data integration process

First, there are the data sources (data warehouses, systems, applications, etc.), and as we mentioned, these can vary. There are wrappers connected to the data sources. The wrappers then send queries to the data sources, and the mediated schema is used to create queries by the users.

In the case of the warehousing approach, instead of wrappers, integration architects need to deal with ETL (extract – transform – load) tool that extracts data from the sources and loads them into the warehouse.

Then there is a source description. This is the key to developing a data integration solution, as this connects the mediated schema with the schemas. It describes the properties of the sources so that the systems can understand the information.

Semantic mapping is the main component of the source description. The semantic mapping specifies how attributes of the mediated schema correspond to the schema between every pair of data sources.

Data mapping

‘Mapping’ is a term I frequently hear in the office when we are talking about the statuses of different projects.

Yes, all integration projects include a lot of data mapping.

Data mappings state how the data should be translated across the different sources. Data mapping is one of the most critical steps of data integration, and it requires a thorough understanding of the semantics of schemas.

Data mapping is vital for defining the source descriptions. First, we need to create semantic matches. This will specify how all elements of the source schemas semantically correspond to the mediated schema. As the next step, matches are turned into semantic mappings as structured queries written in a specific language.

Data governance

Under data governance, we mean the overall management of the information that ensures the availability, usability, integrity, and security of the information.

Typically, integration strategies include a data governance approach.

When we integrate data, we ensure that our customers have the data available at the right time, place, and for all the right stakeholders in a form that the systems and the users can understand.

But before the data would be available, we usually need to do quite a bit of work to ensure the quality of the information.

Data quality

Having bad-quality data at your disposal is almost as bad as not having data at all.

Data quality is probably the most important part of data governance. Having complete and accurate data is crucial. Fixing the issues with the data manually is not an option, as it is time-consuming and error-prone.

Data cleansing is an important element of data integration. It ensures that duplicates are removed, there is no missing data, and all data fields include the correct information. The data scrubbing happens according to the rules of the customers, and it often uses also the master data. The integration architects then build logic that checks all the data against these rules and ensures that only 100% clean data is forwarded.

Data validation

Essentially, data validation is strongly related to data quality. To ensure that the data is complete and structured, we need to validate it against your rules (using a data dictionary) or master data. By validating the information, we ensure that the data has undergone data cleansing and it’s of the highest quality.

Data Enrichment

Data enrichment is used to improve the quality of your data. Typically, certain data fields would be missing, and you would need to enrich these fields. For this, you can use the master data, integrate data from various integrated systems, or in some cases the original data can be automatically forwarded back to the sender for adding the missing data once detected.

Data Integration Tool: iPaaS

Developing data integration solutions by custom coding is not a sustainable option. Once you need to integrate more than three data sources, the integration gets exceptionally challenging.

To overcome the challenges described above and to integrate data faster than ever before in a cost-effective manner, enterprises have started to switch from Enterprise Service Bus (ESB) to iPaaS.

An integration platform as a service can connect any system, whether they are on-premise or in the cloud. It is an optional choice when one needs to integrate more than two or three data sources and methods. An integration platform becomes truly necessary when the data has to come from external parties too, or it has to flow outside the borders of an organization – not only because it gets difficult to work with all the different systems, but it also offers high security.

iPaaS enables faster development of data mapping, as well as it is an excellent tool for data governance and ensuring that the data quality is good by validating and enriching the information.

Do you want to learn more about iPaaS? Get a copy of our eBook:

What is Data Integration?