Skip to content

Automating Data Fusion: Techniques for Handling of Join Scenarios

In real-world data integration scenarios, traditional equi-joins and other join techniques have huge difficulties due to heterogenity and inconsistencies in attribute values. To address this challenge, we present AutoStarJoin, a technique for automated joins specifically designed for star-join scenarios. The core contribution of our approach is the automated detection of join attributes across arbitrary schemas, reducing or even eliminating the need for manual specification. Our approach analyses first the edit-based distance measures, transforming similar string values within join attributes to facilitate matching, and explores then token-based distance measures, refining the join process by identifying optimal attribute pairs. We evaluate the effectiveness of various distance metrics in terms of join quality and computational efficiency across diverse datasets. Our goal is to generalize across different join scenarios without requiring domain-specific parameter tuning. This level of automation makes our approach suitable for integration into Au-toML pipelines where minimal human intervention is desired. However, the correct choice of parameters per data set is crucial. We intend to implement hyperparame-ter optimization in further research.

Links

Cite

BibTeXCopy to clipboard
APACopy to clipboard
IEEECopy to clipboard