PSOBER: PSO based entity resolution

Аассем, Й.; Гафіді, І.; Халфі, Г.; Абутабіт, Н.; Aassem, Y.; Hafidi, I.; Khalfi, H.; Aboutabit, N.

doi:10.23939/mmc2021.04.573

PSOBER: PSO based entity resolution

dc.citation.epage	583
dc.citation.issue	4
dc.citation.spage	573
dc.contributor.affiliation	Університет Султана Мулая Слімана
dc.contributor.affiliation	Sultan Moulay Slimane University
dc.contributor.author	Аассем, Й.
dc.contributor.author	Гафіді, І.
dc.contributor.author	Халфі, Г.
dc.contributor.author	Абутабіт, Н.
dc.contributor.author	Aassem, Y.
dc.contributor.author	Hafidi, I.
dc.contributor.author	Khalfi, H.
dc.contributor.author	Aboutabit, N.
dc.coverage.placename	Львів
dc.coverage.placename	Lviv
dc.date.accessioned	2023-11-01T07:49:13Z
dc.date.available	2023-11-01T07:49:13Z
dc.date.created	2021-03-01
dc.date.issued	2021-03-01
dc.description.abstract	Пов’язування об’єктів — це задача зіставлення записів у базі даних з відповідними об’єктами. Задача пов’язування об’єктів є множиною задач через відсутність повної інформації в записах, варіантний розподіл записів для різних об’єктів, а іноді і перекривання записів різних об’єктів. У цій роботі запропоновано метод вирішення цієї проблеми без необхідності зовнішнього контролю. Вищезгадана задача подається як задача про розбиття. Після цього, запропоновано методику на основі алгоритму оптимізації для вирішення задачі пов’язування об’єктів. Запропонований підхід дозволяє визначити розподіл записів за категоріями. Порівняльний аналіз із генетичним алгоритмом за наборами даних доводить ефективність запропонованого підходу.
dc.description.abstract	Entity Resolution is the task of mapping the records within a database to their corresponding entities. The entity resolution problem presents a lot of challenges because of the absence of complete information in records, variant distribution of records for different entities and sometimes overlaps between records of different entities. In this paper, we have proposed an unsupervised method to solve this problem. The previously mentioned problem is set as a partitioning problem. Thereafter, an optimization algorithm-based technique is proposed to solve the entity resolution problem. The presented approach enables the partitioning of records across entities. A comparative analysis with the genetic algorithm over datasets proves the efficiency of the considered approach.
dc.format.extent	573-583
dc.format.pages	11
dc.identifier.citation	PSOBER: PSO based entity resolution / Y. Aassem, I. Hafidi, H. Khalfi, N. Aboutabit // Mathematical Modeling and Computing. — Lviv : Lviv Politechnic Publishing House, 2021. — Vol 8. — No 4. — P. 573–583.
dc.identifier.citationen	PSOBER: PSO based entity resolution / Y. Aassem, I. Hafidi, H. Khalfi, N. Aboutabit // Mathematical Modeling and Computing. — Lviv : Lviv Politechnic Publishing House, 2021. — Vol 8. — No 4. — P. 573–583.
dc.identifier.doi	10.23939/mmc2021.04.573
dc.identifier.uri	https://ena.lpnu.ua/handle/ntb/60432
dc.language.iso	en
dc.publisher	Видавництво Львівської політехніки
dc.publisher	Lviv Politechnic Publishing House
dc.relation.ispartof	Mathematical Modeling and Computing, 4 (8), 2021
dc.relation.references	[1] Yin X., Han J., Yu P. S. Object Distinction: Distinguishing Objects with Identical Names. IEEE 23rd International Conference on Data Engineering. 1242–1246 (2007).
dc.relation.references	[2] Christen P., Goiser K. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining. 127–151 (2007).
dc.relation.references	[3] Hern´andez M. A., Stolfo S. J. The merge/purge problem for large databases. ACM SIGMOD Record. 24 (2), 127–138 (2007).
dc.relation.references	[4] Mishra S., Mondal S., Saha S. Entity matching technique for bibliographic database. Database and expert systems applications. DEXA 2013. 34–41 (2013).
dc.relation.references	[5] Draisbach U., Naumann F., Szott S., Wonneberg O. Adaptive Windows for Duplicate Detection. 2012 IEEE 28th International Conference on Data Engineering. 1073–1083 (2012).
dc.relation.references	[6] Christen P. Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution and Duplicate Detection. Springer (2012).
dc.relation.references	[7] Aassem Y., Hafidi I., Aboutabit N. Enhanced Duplicate Count Strategy: Towards New Algorithms to Improve Duplicate Detection. NISS2020: Proceedings of the 3rd International Conference on Networking, Information Systems & Security. Article No. 58, 1–7 (2020).
dc.relation.references	[8] Benkhaled H., Berrabah D., Boufares F. A novel approach to improve the Record Linkage process. 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT). 1504–1509 (2019).
dc.relation.references	[9] De Carvalho D. M., Laender A. H. F., Goncalves M. A., Da Silva A. S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineerin. 24 (3), 399–412 (2012).
dc.relation.references	[10] Isele R., Bizer C. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowmen. 5 (11), 1638–1649 (2012).
dc.relation.references	[11] Lyaqini S., Nachaoui M., Quafafou M. Non-smooth classification model based on new smoothing technique. Journal of Physics: Conference Series. 1743 (1), 012025 (2021).
dc.relation.references	[12] Golberg D. E. Genetic algorithms in search, optimization, and machine learning. Addion Wesley Professional (1989).
dc.relation.references	[13] Ribeiro Filho J. L., Treleaven P. C., Alippi C. Genetic algorithm programming environments. Computer. 27 (6), 28–43 (1994).
dc.relation.references	[14] Mishra S., Saha S., Mondal S. GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases. Applied Intelligence. 47, 197–230 (2017).
dc.relation.references	[15] Eberhart R. C., Kennedy J. A new optimizer using particle swarm theory. MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science. 39–43 (1995).
dc.relation.references	[16] Cali´nski T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics. 3 (1), 1–27 (1972).
dc.relation.references	[17] Tang J., Zhang J., Yao L., Li J., Zhang L., Su Z. Arnetminer: extraction and mining of academic social networks. KDD ’08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 990–998 (2008).
dc.relation.references	[18] Tang J., Fong A. C. M., Wang B., Zhang J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering. 24 (6), 975–987 (2012).
dc.relation.references	[19] Wang X., Tang J., Cheng H., Yu P. S. ADANA: Active name disambiguation. 2011 IEEE 11th International Conference on Data Mining. 794–803 (2011).
dc.relation.references	[20] Nachaoui M. Parameter learning for combined first and second order total variation for image reconstruction. Advanced Mathematical Models & Applications. 5 (1), 53–69 (2020).
dc.relation.references	[21] Wang J., Li G., Yu J. X., Feng J. Entity matching: how similar is similar. Proceedings of the VLDB Endowment. 4 (10), 622–633 (2011).
dc.relation.references	[22] Sun Y., Wu T., Yin Z., Cheng H., Han J., Yin X., Zhao P. BibNetMiner: mining bibliographic information networks. SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1341–1344 (2008).
dc.relation.references	[23] DeRose P., Shen W., Chen F., Lee Y., Burdick D., Doan A., Ramakrishnan R. DBLife: A community information management platform for the database research community. CIDR. 169–172 (2007).
dc.relation.references	[24] Jin H., Huang L., Yuan P. Name disambiguation using semantic association clustering. 2009 IEEE International Conference on e-Business Engineering. 42–48 (2009).
dc.relation.references	[25] Mishra S., Saha S., Mondal S. Cluster validation techniques for bibliographic databases. Proceedings of the 2014 IEEE Students’ Technology Symposium. 93–98 (2014).
dc.relation.references	[26] Rousseeuw P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 20, 53–65 (1987).
dc.relation.references	[27] Xie X. L., Beni G. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 13 (8), 841–847 (1991).
dc.relation.references	[28] Mishra S., Saha S., Mondal S. On validation of clustering techniques for bibliographic databases. 2014 22nd International Conference on Pattern Recognition. 3150–3155 (2014).
dc.relation.references	[29] Cramer N. L. A representation for the adaptive generation of simple sequential programs. Proceedings of the First International Conference on Genetic Algorithms. 183–187 (1985).
dc.relation.references	[30] Holland J. H. Adaptation in natural and artificial systems. MIT (1975).
dc.relation.references	[31] De Carvalho M. G., Laender A. H., Goncalves M. A., Da Silva A. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering. 24 (3), 399–412 (2012).
dc.relation.references	[32] Isele R., Bizer C. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment. 5 (11), 1638–1649 (2012).
dc.relation.references	[33] Wagner R. A., Fischer M. J. The String-to-String Correction Problem. Journal of the ACM. 21 (1), 168–173 (1974).
dc.relation.references	[34] Kondrak G. N-gram similarity and distance. Proceedings of the 12th international conference on String Processing and Information Retrieval. 115–126 (2005).
dc.relation.references	[35] Hsu W. J., Du M. W. Computing a longest common subsequence for a set of strings. BIT Numerical Mathematics. 24, 45–59 (1984).
dc.relation.references	[36] Christen P., Churches T. Febrl–Freely extensible biomedical record linkage. ANU Computer Science Technical Reports (2002).
dc.relation.referencesen	[1] Yin X., Han J., Yu P. S. Object Distinction: Distinguishing Objects with Identical Names. IEEE 23rd International Conference on Data Engineering. 1242–1246 (2007).
dc.relation.referencesen	[2] Christen P., Goiser K. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining. 127–151 (2007).
dc.relation.referencesen	[3] Hern´andez M. A., Stolfo S. J. The merge/purge problem for large databases. ACM SIGMOD Record. 24 (2), 127–138 (2007).
dc.relation.referencesen	[4] Mishra S., Mondal S., Saha S. Entity matching technique for bibliographic database. Database and expert systems applications. DEXA 2013. 34–41 (2013).
dc.relation.referencesen	[5] Draisbach U., Naumann F., Szott S., Wonneberg O. Adaptive Windows for Duplicate Detection. 2012 IEEE 28th International Conference on Data Engineering. 1073–1083 (2012).
dc.relation.referencesen	[6] Christen P. Data Matching: Concepts and Techniques for Record Linkage. Entity Resolution and Duplicate Detection. Springer (2012).
dc.relation.referencesen	[7] Aassem Y., Hafidi I., Aboutabit N. Enhanced Duplicate Count Strategy: Towards New Algorithms to Improve Duplicate Detection. NISS2020: Proceedings of the 3rd International Conference on Networking, Information Systems & Security. Article No. 58, 1–7 (2020).
dc.relation.referencesen	[8] Benkhaled H., Berrabah D., Boufares F. A novel approach to improve the Record Linkage process. 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT). 1504–1509 (2019).
dc.relation.referencesen	[9] De Carvalho D. M., Laender A. H. F., Goncalves M. A., Da Silva A. S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineerin. 24 (3), 399–412 (2012).
dc.relation.referencesen	[10] Isele R., Bizer C. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowmen. 5 (11), 1638–1649 (2012).
dc.relation.referencesen	[11] Lyaqini S., Nachaoui M., Quafafou M. Non-smooth classification model based on new smoothing technique. Journal of Physics: Conference Series. 1743 (1), 012025 (2021).
dc.relation.referencesen	[12] Golberg D. E. Genetic algorithms in search, optimization, and machine learning. Addion Wesley Professional (1989).
dc.relation.referencesen	[13] Ribeiro Filho J. L., Treleaven P. C., Alippi C. Genetic algorithm programming environments. Computer. 27 (6), 28–43 (1994).
dc.relation.referencesen	[14] Mishra S., Saha S., Mondal S. GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases. Applied Intelligence. 47, 197–230 (2017).
dc.relation.referencesen	[15] Eberhart R. C., Kennedy J. A new optimizer using particle swarm theory. MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science. 39–43 (1995).
dc.relation.referencesen	[16] Cali´nski T., Harabasz J. A dendrite method for cluster analysis. Communications in Statistics. 3 (1), 1–27 (1972).
dc.relation.referencesen	[17] Tang J., Zhang J., Yao L., Li J., Zhang L., Su Z. Arnetminer: extraction and mining of academic social networks. KDD ’08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 990–998 (2008).
dc.relation.referencesen	[18] Tang J., Fong A. C. M., Wang B., Zhang J. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering. 24 (6), 975–987 (2012).
dc.relation.referencesen	[19] Wang X., Tang J., Cheng H., Yu P. S. ADANA: Active name disambiguation. 2011 IEEE 11th International Conference on Data Mining. 794–803 (2011).
dc.relation.referencesen	[20] Nachaoui M. Parameter learning for combined first and second order total variation for image reconstruction. Advanced Mathematical Models & Applications. 5 (1), 53–69 (2020).
dc.relation.referencesen	[21] Wang J., Li G., Yu J. X., Feng J. Entity matching: how similar is similar. Proceedings of the VLDB Endowment. 4 (10), 622–633 (2011).
dc.relation.referencesen	[22] Sun Y., Wu T., Yin Z., Cheng H., Han J., Yin X., Zhao P. BibNetMiner: mining bibliographic information networks. SIGMOD ’08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1341–1344 (2008).
dc.relation.referencesen	[23] DeRose P., Shen W., Chen F., Lee Y., Burdick D., Doan A., Ramakrishnan R. DBLife: A community information management platform for the database research community. CIDR. 169–172 (2007).
dc.relation.referencesen	[24] Jin H., Huang L., Yuan P. Name disambiguation using semantic association clustering. 2009 IEEE International Conference on e-Business Engineering. 42–48 (2009).
dc.relation.referencesen	[25] Mishra S., Saha S., Mondal S. Cluster validation techniques for bibliographic databases. Proceedings of the 2014 IEEE Students’ Technology Symposium. 93–98 (2014).
dc.relation.referencesen	[26] Rousseeuw P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 20, 53–65 (1987).
dc.relation.referencesen	[27] Xie X. L., Beni G. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence. 13 (8), 841–847 (1991).
dc.relation.referencesen	[28] Mishra S., Saha S., Mondal S. On validation of clustering techniques for bibliographic databases. 2014 22nd International Conference on Pattern Recognition. 3150–3155 (2014).
dc.relation.referencesen	[29] Cramer N. L. A representation for the adaptive generation of simple sequential programs. Proceedings of the First International Conference on Genetic Algorithms. 183–187 (1985).
dc.relation.referencesen	[30] Holland J. H. Adaptation in natural and artificial systems. MIT (1975).
dc.relation.referencesen	[31] De Carvalho M. G., Laender A. H., Goncalves M. A., Da Silva A. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering. 24 (3), 399–412 (2012).
dc.relation.referencesen	[32] Isele R., Bizer C. Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment. 5 (11), 1638–1649 (2012).
dc.relation.referencesen	[33] Wagner R. A., Fischer M. J. The String-to-String Correction Problem. Journal of the ACM. 21 (1), 168–173 (1974).
dc.relation.referencesen	[34] Kondrak G. N-gram similarity and distance. Proceedings of the 12th international conference on String Processing and Information Retrieval. 115–126 (2005).
dc.relation.referencesen	[35] Hsu W. J., Du M. W. Computing a longest common subsequence for a set of strings. BIT Numerical Mathematics. 24, 45–59 (1984).
dc.relation.referencesen	[36] Christen P., Churches T. Febrl–Freely extensible biomedical record linkage. ANU Computer Science Technical Reports (2002).
dc.rights.holder	© Національний університет “Львівська політехніка”, 2021
dc.subject	пов’язування об’єктів
dc.subject	індекс валідності кластера
dc.subject	метод рою частинок
dc.subject	міра відстані
dc.subject	генетичний алгоритм
dc.subject	некерований алгоритм
dc.subject	entity resolution
dc.subject	cluster validity index
dc.subject	particle swarm optimization
dc.subject	distance measure
dc.subject	genetic algorithm
dc.subject	unsupervised algorithm
dc.title	PSOBER: PSO based entity resolution
dc.title.alternative	PSOBER: пов’язування об’єктів на основі PSO
dc.type	Article

Files

Original bundle

Now showing 1 - 2 of 2

Name:: 2021v8n4_Aassem_Y-PSOBER_PSO_based_entity_573-583.pdf
Size:: 1.04 MB
Format:: Adobe Portable Document Format

Download

Name:: 2021v8n4_Aassem_Y-PSOBER_PSO_based_entity_573-583__COVER.png
Size:: 446.9 KB
Format:: Portable Network Graphics

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Collections

Mathematical Modeling And Computing. – 2021. – Vol. 8, No. 4