A Generic Approach to Entity Resolution Mechanisms for Big Data on Real World Match Problems in the Global Oil and Gas Sector
Complex challenges are facing the global oil and gas industry. Oil prices are dropping due to OPEC production level, US oil boom, and other factors. Many experts believe that prices of oil will remain low for years at equilibrium of around $40-50 (Blumberg, 2018; Walls and Zheng 2018; Azar, 2019). Although 2019 oil price is expected to average at $65 with a further decline at $62 by 2020 (Amadeo, 2019; Kasim, 2019). Also, newly commercial resources are extremely expensive to develop, as massive capital investments are required. This research intends to develop a comprehensive entity resolution framework that has the ability to search across multiple databases with disparate forms, tame large amounts of data very quickly, efficiently resolving multiple entities into one, as well as finding hidden connections without human intervention. Putting in place a system to manage these entities will not only help to better assign resources, but to do so in a more expedient fashion. Although the necessary information is mostly already available within the oil and gas companies, it is spread around different company areas and application. Entity resolution will helps to aggregate these data, identify and exploit connection between entities and offer holistic all-in-one information that can helps to identify and deal with potential risk. We therefore present such an evaluation of existing implementations on challenging real-world match tasks. We consider approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community we also consider a state-of-the-art commercial entity resolution implementation. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from Opec database are not sufficiently solved with conventional approaches based on the similarity of attribute values.