R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes

Бабіч, Ю.; Глазунова, Л.; Калініна, Т.; Петрович, Я.; Babich, Y.; Hlazunova, L.; Kalinina, T.; Petrovych, Y.

doi:https://doi.org/10.23939/ictee2024.02.010

R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes

dc.citation.epage	18
dc.citation.issue	2
dc.citation.journalTitle	Інфокомунікаційні технології та електронна інженерія
dc.citation.spage	10
dc.citation.volume	4
dc.contributor.affiliation	Державний університет інтелектуальних технологій і зв’язку
dc.contributor.affiliation	State University of Intellectual Technologies and Telecommunications
dc.contributor.author	Бабіч, Ю.
dc.contributor.author	Глазунова, Л.
dc.contributor.author	Калініна, Т.
dc.contributor.author	Петрович, Я.
dc.contributor.author	Babich, Y.
dc.contributor.author	Hlazunova, L.
dc.contributor.author	Kalinina, T.
dc.contributor.author	Petrovych, Y.
dc.coverage.placename	Львів
dc.coverage.placename	Lviv
dc.date.accessioned	2025-11-03T11:06:29Z
dc.date.created	2024-12-10
dc.date.issued	2024-12-10
dc.description.abstract	R2 або коефіцієнт детермінації часто використовується як метрика для оцінювання регресійних моделей. Її можна застосовувати окремо, але зазвичай її поєднують з іншими метриками, щоб підвищити точність оцінки моделі. Метою роботи є дослідження динаміки метрики R2 регресійної моделі к-найближчих сусідів, навченої на серіях різного розміру, щоб запропонувати новий підхід для підвищення надійності та точності оцінки моделі, коли метрика R2 використовується самостійно, без застосування інших метрик. Як правило, значення метрики R2 понад 0,8 вважається прийнятним, а оцінювана модель достатньо точною. Однак такий спосіб інтерпретації оцінки R2 може призвести до невправильної оцінки точності моделі, що і показано в запропонованій статті. Отримані результати свідчать, що значення метрики R2 можуть істотно відрізнятися в деяких випадках залежно від конкретних значень ознак, відібраних до тестової частини вибірки, використовуваної для оцінювання моделі. Зазначене відхилення може спричиняти завищення точності моделі,а це – призвести до некоректних результатів її застосування. Відомі методи підвищення точності оцінювання моделі передбачають використання інших метрик додатково. Натомість ця стаття зосереджена на підвищенні оцінки точності моделі без необхідності використання інших метрик. Динаміку метрики R2 досліджено за допомогою 25000 циклів навчання та оцінки регресійної моделі к-найближчих сусідів. До навчальної та тестової частин вибірки відібрано випадкові значення. Для всіх експериментів кількість сусідів фіксована та дорівнює значенню за замовчуванням n_neighbors=5 методу Kneighbors Regressor, наданого бібліотекою Sklearn. У роботі сформульовано та підтверджено гіпотезу про те, що варіація метрики R2, як очікується, збільшиться зі зменшенням розміру серії, і передбачено, що варіація буде спостерігатися для моделей, навчених на тій самій вибірці, через випадковість відбирання навчальних / тестових значень. Експерименти дали змогу запропонувати альтернативний підхід, який не потребує додаткових метрик. Цей підхід передбачає застосування метрики R2 разом із її варіацією, яка не повинна перевищувати 0,2 для регресійної моделі к-найближчих сусідів.
dc.description.abstract	An R2 score or a coefficient of determination is used often as a metric to evaluate regression models. It can be applied solely but usually it is combined with other metrics in order to increase accuracy of a model evaluation. The goal of the work is to research the dynamics of the R2 score of a K-Nearest Neighbors regression model trained on series of different sizes in order to propose a new approach to increase the robustness and accuracy of the model evaluation when the R2 score metric is used solely. Typically, a value of the R2 score metric above 0.8 is considered to be sufficient while an evaluated model is considered to be accurate enough. However, such a way of R2 score interpretation to may lead to model’s accuracy misevaluation, which is shown in the proposed paper. The results obtained clearly display that R2 score can vary significantly in some cases depending on the samples selected to test part of a series used for model evaluation. The mentioned variation can contribute to model’s accuracy overestimation, which, in turn can lead to incorrect results of model application. The known methods to make model estimation more accurate involve use of other metrics. Instead, this paper focuses on increase of model’s accuracy estimation without the necessity of using other metrics. The R2 score dynamics is examined using 25000 cycles of the K-Nearest Neighbors regression model training and evaluation. Selection of samples to a training or test part of a series has been done randomly. For all the experiments quantity of neighbors is fixed and equals to the default value of n_neighbors=5 of the KNeighborsRegressor method provided by the Sklearn library. The paper both states and proves a hypothesis that the R2 score variation is expected to increase with series size reduction and the variation is supposed to be observed for models trained on the same series because of training/test samples selection randomness. The experiments carried out allowed to propose an alternative approach that did not require any supplementary metrics. The proposed approach considers application of the R2 score along with its variation that must not exceed 0.2 for the K-Nearest Neighbors regression model.
dc.format.extent	10-18
dc.format.pages	9
dc.identifier.citation	R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes / Y. Babich, L. Hlazunova, T. Kalinina, Y. Petrovych // Infocommunication technologies and electronic engineering. — Lviv : Lviv Politechnic Publishing House, 2024. — Vol 4. — No 2. — P. 10–18.
dc.identifier.citation2015	R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes / Babich Y. та ін. // Infocommunication technologies and electronic engineering, Lviv. 2024. Vol 4. No 2. P. 10–18.
dc.identifier.citationenAPA	Babich, Y., Hlazunova, L., Kalinina, T., & Petrovych, Y. (2024). R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes. Infocommunication technologies and electronic engineering, 4(2), 10-18. Lviv Politechnic Publishing House..
dc.identifier.citationenCHICAGO	Babich Y., Hlazunova L., Kalinina T., Petrovych Y. (2024) R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes. Infocommunication technologies and electronic engineering (Lviv), vol. 4, no 2, pp. 10-18.
dc.identifier.doi	https://doi.org/10.23939/ictee2024.02.010
dc.identifier.uri	https://ena.lpnu.ua/handle/ntb/116924
dc.language.iso	en
dc.publisher	Видавництво Львівської політехніки
dc.publisher	Lviv Politechnic Publishing House
dc.relation.ispartof	Інфокомунікаційні технології та електронна інженерія, 2 (4), 2024
dc.relation.ispartof	Infocommunication technologies and electronic engineering, 2 (4), 2024
dc.relation.references	[1] Sarkar, D., Bali, R., Sharma, T. (2018). Practical Machine Learning with Python. A Problem-Solver's Guide to Building Real-World Intelligent Systems. Apress Berkeley, CA, 545 p. DOI: 10.1007/978-1-4842-3207-1.
dc.relation.references	[2] Scikit-learn library web page. Sklearn.metrics.r2_score. Available at https://scikitlearn. org/stable/modules/generated/sklearn.metrics.r2_score.html
dc.relation.references	[3] Nakagawa S., Johnson, P., Schielzeth, H. (2017). The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of The Royal Society Interface, vol. 14(134), pp. 1–11. DOI: 10.1098/rsif.2017.0213.
dc.relation.references	[4] Zhang, D. (2017). A Coefficient of Determination for Generalized Linear Models, The American Statistician, vol. 71:4, pp. 310–316. DOI: 10.1080/00031305.2016.1256839
dc.relation.references	[5] Gurubaran, K. et al. (2023). Machine Learning Approach for Soil Nutrient Prediction. 2023 IEEE Silchar Subsection Conference (SILCON), Silchar, India, 2023, pp. 1–6. DOI: 10.1109/SILCON59133.2023.10405095.
dc.relation.references	[6] Gehlot, A., Sidana, N., Jawale, D., Jain, N., Singh, B. P., Singh, B. (2022). Technical analysis of crop production prediction using Machine Learning and Deep Learning Algorithms. 2022 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India,2022, pp. 1–5. DOI: 10.1109/ICSES55317.2022.9914206.
dc.relation.references	[7] Tran, T. T. H., et al. (2022). Polygenic risk scores adaptation for Height in a Vietnamese population. 14th International Conference on Knowledge and Systems Engineering (KSE), Nha Trang, Vietnam, 2022, pp. 1–7.DOI: 10.1109/KSE56063.2022.9953620.
dc.relation.references	[8] Aulia, Y., Purnamasari, P. D., Zulkifli, F. Y. (2023). A Comparative Analysis of Machine Learning Algorithms for Predicting the Dimensions of Rectangular Microstrip Antennas. 2023 IEEE International Symposium On Antennas And Propagation (ISAP), Kuala Lumpur, Malaysia, 2023, pp. 1–2. DOI:10.1109/ISAP57493.2023.10388517.
dc.relation.references	[9] Shashank, S., Gourisaria, M.K., Bilgaiyan, S. (2023). Weather Forecasting Based Shared Bike Demand Analysis using Machine Learning. 6th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 2023, pp. 1–6. DOI: 10.1109/ISCON57294.2023.10112160.
dc.relation.references	[10] Kumar, A., Mishra, S K., Kejriwal, A. (2022). Prediction of Happiness Score of Countries by Considering Maximum Infection Rate of People by COVID-19 using Random Forest Algorithm. 2nd International Conference on Intelligent Technologies (CONIT), Hubli, India, 2022, pp. 1–6. DOI:10.1109/CONIT55038.2022.9847791.
dc.relation.references	[11] Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc., Sebastopol, CA., USA. 510 p. ISBN: 9781492032649.
dc.relation.references	[12] Pandas library web page via NumFOCUS Inc. Available at https://pandas.pydata.org/
dc.relation.references	[13] NumPy. The fundamental package for scientific computing with Python by NumPy team. Available athttps://numpy.org/
dc.relation.references	[14] Scikit-learn library web page. Sklearn.neighbors.KNeighborsRegressor. Available at https://scikitlearn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
dc.relation.references	[15] Matplotlib.pyplot by the Matplotlib development team. Available at https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
dc.relation.referencesen	[1] Sarkar, D., Bali, R., Sharma, T. (2018). Practical Machine Learning with Python. A Problem-Solver's Guide to Building Real-World Intelligent Systems. Apress Berkeley, CA, 545 p. DOI: 10.1007/978-1-4842-3207-1.
dc.relation.referencesen	[2] Scikit-learn library web page. Sklearn.metrics.r2_score. Available at https://scikitlearn. org/stable/modules/generated/sklearn.metrics.r2_score.html
dc.relation.referencesen	[3] Nakagawa S., Johnson, P., Schielzeth, H. (2017). The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of The Royal Society Interface, vol. 14(134), pp. 1–11. DOI: 10.1098/rsif.2017.0213.
dc.relation.referencesen	[4] Zhang, D. (2017). A Coefficient of Determination for Generalized Linear Models, The American Statistician, vol. 71:4, pp. 310–316. DOI: 10.1080/00031305.2016.1256839
dc.relation.referencesen	[5] Gurubaran, K. et al. (2023). Machine Learning Approach for Soil Nutrient Prediction. 2023 IEEE Silchar Subsection Conference (SILCON), Silchar, India, 2023, pp. 1–6. DOI: 10.1109/SILCON59133.2023.10405095.
dc.relation.referencesen	[6] Gehlot, A., Sidana, N., Jawale, D., Jain, N., Singh, B. P., Singh, B. (2022). Technical analysis of crop production prediction using Machine Learning and Deep Learning Algorithms. 2022 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India,2022, pp. 1–5. DOI: 10.1109/ICSES55317.2022.9914206.
dc.relation.referencesen	[7] Tran, T. T. H., et al. (2022). Polygenic risk scores adaptation for Height in a Vietnamese population. 14th International Conference on Knowledge and Systems Engineering (KSE), Nha Trang, Vietnam, 2022, pp. 1–7.DOI: 10.1109/KSE56063.2022.9953620.
dc.relation.referencesen	[8] Aulia, Y., Purnamasari, P. D., Zulkifli, F. Y. (2023). A Comparative Analysis of Machine Learning Algorithms for Predicting the Dimensions of Rectangular Microstrip Antennas. 2023 IEEE International Symposium On Antennas And Propagation (ISAP), Kuala Lumpur, Malaysia, 2023, pp. 1–2. DOI:10.1109/ISAP57493.2023.10388517.
dc.relation.referencesen	[9] Shashank, S., Gourisaria, M.K., Bilgaiyan, S. (2023). Weather Forecasting Based Shared Bike Demand Analysis using Machine Learning. 6th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 2023, pp. 1–6. DOI: 10.1109/ISCON57294.2023.10112160.
dc.relation.referencesen	[10] Kumar, A., Mishra, S K., Kejriwal, A. (2022). Prediction of Happiness Score of Countries by Considering Maximum Infection Rate of People by COVID-19 using Random Forest Algorithm. 2nd International Conference on Intelligent Technologies (CONIT), Hubli, India, 2022, pp. 1–6. DOI:10.1109/CONIT55038.2022.9847791.
dc.relation.referencesen	[11] Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media, Inc., Sebastopol, CA., USA. 510 p. ISBN: 9781492032649.
dc.relation.referencesen	[12] Pandas library web page via NumFOCUS Inc. Available at https://pandas.pydata.org/
dc.relation.referencesen	[13] NumPy. The fundamental package for scientific computing with Python by NumPy team. Available athttps://numpy.org/
dc.relation.referencesen	[14] Scikit-learn library web page. Sklearn.neighbors.KNeighborsRegressor. Available at https://scikitlearn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
dc.relation.referencesen	[15] Matplotlib.pyplot by the Matplotlib development team. Available at https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
dc.relation.uri	https://scikitlearn
dc.relation.uri	https://pandas.pydata.org/
dc.relation.uri	https://numpy.org/
dc.relation.uri	https://scikitlearn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
dc.relation.uri	https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html
dc.rights.holder	© Національний університет „Львівська політехніка“, 2024
dc.subject	розмір вибірки
dc.subject	метрика R2
dc.subject	коефіцієнт детермінації
dc.subject	регресійна модель
dc.subject	series size
dc.subject	R2 score
dc.subject	coefficient of determination
dc.subject	regression model
dc.subject.udc	004.8.
dc.title	R2 metric dynamics for k-nearest neighbors regression model trained on series of different sizes
dc.title.alternative	Динаміка метрики R2 для моделі регресії KNN, навченої на вибірках різного розміру
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 2024v4n2_Babich_Y-R2_metric_dynamics_for_k_nearest_10-18.pdf
Size:: 779.1 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.84 KB
Format:: Plain Text
Description:

Download

Collections

Infocommunication Technologies and Electronic Engineering. – 2024. – Vol. 4, No. 2