Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling

Hina Ayub; Harun Jamil

doi:10.61927/igmin140

28 of 183

The Contribution of Medical Periodicals to the Development of Pediatric Science in Modern Conditions

Nikolay Shchapov

30 of 183

Association and New Therapy Perspectives in Post-Stroke Aphasia with Hand Motor Dysfunction

Shuo Xu, Chengfang Liang, Shaofan Chen, Zhiming Huang and Haoqing Jiang

Engineering Group Research Article 記事ID: igmin140

Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling

Information Technology Data Engineering Artificial Intelligence DOI10.61927/igmin140

Hina Ayub ¹ and

Harun Jamil ^* ²

Affiliation

¹Interdisciplinary Graduate Program in Advance Convergence Technology and Science, Jeju National University, Jeju, 63243, Republic of Korea

²Department of Electronics Engineering, Jeju National University, Jeju, 63243, Jeju-do, Republic of Korea

Fulltext HTML Fulltext PDF Cite this article

32

REFERENCES

3.3k

VIEWS

900

DOWNLOADS

105

要約

This paper tackles the vital issue of missing value imputation in data preprocessing, where traditional techniques like zero, mean, and KNN imputation fall short in capturing intricate data relationships. This often results in suboptimal outcomes, and discarding records with missing values leads to significant information loss. Our innovative approach leverages advanced transformer models renowned for handling sequential data. The proposed predictive framework trains a transformer model to predict missing values, yielding a marked improvement in imputation accuracy. Comparative analysis against traditional methods—zero, mean, and KNN imputation—consistently favors our transformer model. Importantly, LSTM validation further underscores the superior performance of our approach. In hourly data, our model achieves a remarkable R2 score of 0.96, surpassing KNN imputation by 0.195. For daily data, the R2 score of 0.806 outperforms KNN imputation by 0.015 and exhibits a notable superiority of 0.25 over mean imputation. Additionally, in monthly data, the proposed model’s R2 score of 0.796 excels, showcasing a significant improvement of 0.1 over mean imputation. These compelling results highlight the proposed model’s ability to capture underlying patterns, offering valuable insights for enhancing missing values imputation in data analyses.

数字

参考文献

Du J, Hu M, Zhang W. Missing data problem in the monitoring system: A review. IEEE Sensors Journal. 2020; 20(23):13984-13998.
Alruhaymi AZ, Kim CJ. Study on the Missing Data Mechanisms and Imputation Methods. Open Journal of Statistics. 2021; 11(4):477-492.
Liu J, Pasumarthi S, Duffy B, Gong E, Datta K, Zaharchuk G. One Model to Synthesize Them All: Multi-Contrast Multi-Scale Transformer for Missing Data Imputation. IEEE Trans Med Imaging. 2023 Sep;42(9):2577-2591. doi: 10.1109/TMI.2023.3261707. Epub 2023 Aug 31. PMID: 37030684; PMCID: PMC10543020.
Edelman BL, Goel S, Kakade S, Zhang C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning. PMLR. 2022; 5793-5831.
Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. Biology (Basel). 2023 Jul 22;12(7):1033. doi: 10.3390/biology12071033. PMID: 37508462; PMCID: PMC10376273.
Schafer JL. Analysis of incomplete multivariate data. CRC press. 1997.
Menard S. Applied logistic regression analysis. Sage. 2002. 106.
Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons. 2019; 793.
Hadeed SJ, O'Rourke MK, Burgess JL, Harris RB, Canales RA. Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ. 2020 Aug 15;730:139140. doi: 10.1016/j.scitotenv.2020.139140. Epub 2020 May 3. PMID: 32402974; PMCID: PMC7745257.
Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform. 2022 Jan 17;23(1):bbab489. doi: 10.1093/bib/bbab489. PMID: 34882223; PMCID: PMC8769894.
Wang M, Gan J, Han C, Guo Y, Chen K, Shi YZ, Zhang BG. Imputation methods for scRNA sequencing data. Applied Sciences. 2022; 12(20):10684.
Samad T, Harp SA. Self–organization with partial data. Network: Computation in Neural Systems. 1992; 3(2):205-212.
Fessant F, Midenet S. Self-organising map for data imputation and correction in surveys. Neural Computing & Applications. 2002; 10:300-310.
Westin LK. Missing data and the preprocessing perceptron. Univ. 2004.
Sherwood B, Wang L, Zhou XH. Weighted quantile regression for analyzing health care cost data with missing covariates. Stat Med. 2013 Dec 10;32(28):4967-79. doi: 10.1002/sim.5883. Epub 2013 Jul 9. PMID: 23836597.
Crambes C, Henchiri Y. Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference. 2019; 201:103-119.
Siswantining T, Soemartojo SM, Sarwinda D. Application of sequential regression multivariate imputation method on multivariate normal missing data. In 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS). IEEE. 2019; 1-6.
Andridge RR, Little RJ. A Review of Hot Deck Imputation for Survey Non-response. Int Stat Rev. 2010 Apr;78(1):40-64. doi: 10.1111/j.1751-5823.2010.00103.x. PMID: 21743766; PMCID: PMC3130338.
Rubin LH, Witkiewitz K, Andre JS, Reilly S. Methods for Handling Missing Data in the Behavioral Neurosciences: Don't Throw the Baby Rat out with the Bath Water. J Undergrad Neurosci Educ. 2007 Spring;5(2):A71-7. Epub 2007 Jun 15. PMID: 23493038; PMCID: PMC3592650.
Rubin DB. Inference and missing data. Biometrika. 1976; 63(3):581-592.
Uusitalo L, Lehikoinen A, Helle I, Myrberg K. An overview of methods to evaluate uncertainty of deterministic models in decision support. Environmental Modelling & Software. 2015; 63:24-31.
Kabir G, Tesfamariam S, Hemsing J, Sadiq R. Handling incomplete and missing data in water network database using imputation methods. Sustainable and Resilient Infrastructure. 2020; 5(6):365-377.
Yu L, Zhou R, Chen R, Lai KK. Missing data preprocessing in credit classification: One-hot encoding or imputation?. Emerging Markets Finance and Trade. 2022; 58(2):472-482.
Al-Helali B, Chen Q, Xue B, Zhang M. A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Computing. 2021; 25:5993-6012.
Zhao Y, Long Q. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res. 2016 Oct;25(5):2021-2035. doi: 10.1177/0962280213511027. Epub 2013 Nov 25. PMID: 24275026.
Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol. 2018 Dec 12;18(1):168. doi: 10.1186/s12874-018-0615-6. PMID: 30541455; PMCID: PMC6292063.
Horton NJ, Lipsitz SR, Parzen M. A potential for bias when rounding in multiple imputation. The American Statistician. 2003; 57(4):229-232.
Yi J, Lee J, Kim KJ, Hwang SJ, Yang E. Why not to use zero imputation? correcting sparsity bias in training neural networks. arXiv preprint arXiv:1906.00150. 2019.
Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27. PMID: 34722113; PMCID: PMC8549433.
Mohammed MB, Zulkafli HS, Adam MB, Ali N, Baba IA. Comparison of five imputation methods in handling missing data in a continuous frequency table. In AIP Conference Proceedings. AIP Publishing. 2021; 2355:1
Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence. 2019; 33(10):913-933.
Staudemeyer RC, Morris ER. Understanding LSTM--a tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586. 2019.

類似の記事

Fred Vlès – Early Holistic Biophysicist and Pioneer of “Earthing”
Marco Bischof
DOI10.61927/igmin129

How Increased CO2 Warms the Earth-Two Contexts for the Greenhouse Gas Effect
Donald Rapp
DOI10.61927/igmin259

Theoretical Review on Microplastic Pollution: A Multifaceted Threat to Marine Ecosystems, Human Health, and Environment
Saisanthosh Vamshi Harsha Madiraju, Abhiram Siva Prasad Pamula and Bhanu Prakash Darsi
DOI10.61927/igmin203

The Expressivity Dimension of Speech is the basis of the Expression Dimension. Evidence from Behavioural and Neuroimaging Studies
Isabelle Hesling
DOI10.61927/igmin182

Erosion Corrosion of Commercially Pure Titanium and Ti-6Al-4V Alloy in Sodium Chloride Solutions with and Without Suspended Solids
T Hodgkiess and D Mantzavinos
DOI10.61927/igmin284

Wishful Thinking or Valuable Forecasts? The Value of Policy Rate Predictions in Sweden
Åsa Hansson
DOI10.61927/igmin116

Analytical Expressions of the Markov Chain of K-Ras4B Protein within the Catalytic Environment and a New Markov-State Model
Orchidea Maria Lecian
DOI10.61927/igmin133

Quality Culture - Lessons Learned from the Low- and Medium Income World
Cees Th Smit Sibinga
DOI10.61927/igmin262

A Rare Entity of Idiopathic Clitoromegaly with HBsAg Positive Status Managed with Dorsal Nerve Sparing Clitoroplasty
Maharjan N, Pokharel PB, Lamichhane A and Dahal P
DOI10.61927/igmin254

Fibrin Contributes to an Improvement of an in vitro Wound Repair Model using Fibroblast-populated Collagen Lattices
Mario Chopin-Doroteo, Aldo Montes de Oca-Delgado, Rosa M Salgado and Edgar Krötzsch
DOI10.61927/igmin159

Page Navigation

研究を公開する

私たちは、科学、技術、工学、医学に関する幅広い種類の記事を編集上の偏見なく公開しています。

提出する

見る原稿のガイドライン追加論文処理料

IgMin 科目を探索する

トップ10の記事をクリック

クイックリンク

原稿を提出する

研究論文

[1] Du J, Hu M, Zhang W. Missing data problem in the monitoring system: A review. IEEE Sensors Journal. 2020; 20(23):13984-13998.

[2] Alruhaymi AZ, Kim CJ. Study on the Missing Data Mechanisms and Imputation Methods. Open Journal of Statistics. 2021; 11(4):477-492.

[3] Liu J, Pasumarthi S, Duffy B, Gong E, Datta K, Zaharchuk G. One Model to Synthesize Them All: Multi-Contrast Multi-Scale Transformer for Missing Data Imputation. IEEE Trans Med Imaging. 2023 Sep;42(9):2577-2591. doi: 10.1109/TMI.2023.3261707. Epub 2023 Aug 31. PMID: 37030684; PMCID: PMC10543020.

[4] Edelman BL, Goel S, Kakade S, Zhang C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning. PMLR. 2022; 5793-5831.

[5] Choi SR, Lee M. Transformer Architecture and Attention Mechanisms in Genome Data Analysis: A Comprehensive Review. Biology (Basel). 2023 Jul 22;12(7):1033. doi: 10.3390/biology12071033. PMID: 37508462; PMCID: PMC10376273.

[6] Schafer JL. Analysis of incomplete multivariate data. CRC press. 1997.

[7] Menard S. Applied logistic regression analysis. Sage. 2002. 106.

[8] Little RJ, Rubin DB. Statistical analysis with missing data. John Wiley & Sons. 2019; 793.

[9] Hadeed SJ, O'Rourke MK, Burgess JL, Harris RB, Canales RA. Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ. 2020 Aug 15;730:139140. doi: 10.1016/j.scitotenv.2020.139140. Epub 2020 May 3. PMID: 32402974; PMCID: PMC7745257.

[10] Luo Y. Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform. 2022 Jan 17;23(1):bbab489. doi: 10.1093/bib/bbab489. PMID: 34882223; PMCID: PMC8769894.

[11] Wang M, Gan J, Han C, Guo Y, Chen K, Shi YZ, Zhang BG. Imputation methods for scRNA sequencing data. Applied Sciences. 2022; 12(20):10684.

[12] Samad T, Harp SA. Self–organization with partial data. Network: Computation in Neural Systems. 1992; 3(2):205-212.

[13] Fessant F, Midenet S. Self-organising map for data imputation and correction in surveys. Neural Computing & Applications. 2002; 10:300-310.

[14] Westin LK. Missing data and the preprocessing perceptron. Univ. 2004.

[15] Sherwood B, Wang L, Zhou XH. Weighted quantile regression for analyzing health care cost data with missing covariates. Stat Med. 2013 Dec 10;32(28):4967-79. doi: 10.1002/sim.5883. Epub 2013 Jul 9. PMID: 23836597.

[16] Crambes C, Henchiri Y. Regression imputation in the functional linear model with missing values in the response. Journal of Statistical Planning and Inference. 2019; 201:103-119.

[17] Siswantining T, Soemartojo SM, Sarwinda D. Application of sequential regression multivariate imputation method on multivariate normal missing data. In 2019 3rd International Conference on Informatics and Computational Sciences (ICICoS). IEEE. 2019; 1-6.

[18] Andridge RR, Little RJ. A Review of Hot Deck Imputation for Survey Non-response. Int Stat Rev. 2010 Apr;78(1):40-64. doi: 10.1111/j.1751-5823.2010.00103.x. PMID: 21743766; PMCID: PMC3130338.

[19] Rubin LH, Witkiewitz K, Andre JS, Reilly S. Methods for Handling Missing Data in the Behavioral Neurosciences: Don't Throw the Baby Rat out with the Bath Water. J Undergrad Neurosci Educ. 2007 Spring;5(2):A71-7. Epub 2007 Jun 15. PMID: 23493038; PMCID: PMC3592650.

[20] Rubin DB. Inference and missing data. Biometrika. 1976; 63(3):581-592.

[21] Uusitalo L, Lehikoinen A, Helle I, Myrberg K. An overview of methods to evaluate uncertainty of deterministic models in decision support. Environmental Modelling & Software. 2015; 63:24-31.

[22] Kabir G, Tesfamariam S, Hemsing J, Sadiq R. Handling incomplete and missing data in water network database using imputation methods. Sustainable and Resilient Infrastructure. 2020; 5(6):365-377.

[23] Yu L, Zhou R, Chen R, Lai KK. Missing data preprocessing in credit classification: One-hot encoding or imputation?. Emerging Markets Finance and Trade. 2022; 58(2):472-482.

[24] Al-Helali B, Chen Q, Xue B, Zhang M. A new imputation method based on genetic programming and weighted KNN for symbolic regression with incomplete data. Soft Computing. 2021; 25:5993-6012.

[25] Zhao Y, Long Q. Multiple imputation in the presence of high-dimensional data. Stat Methods Med Res. 2016 Oct;25(5):2021-2035. doi: 10.1177/0962280213511027. Epub 2013 Nov 25. PMID: 24275026.

[26] Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Med Res Methodol. 2018 Dec 12;18(1):168. doi: 10.1186/s12874-018-0615-6. PMID: 30541455; PMCID: PMC6292063.

[27] Horton NJ, Lipsitz SR, Parzen M. A potential for bias when rounding in multiple imputation. The American Statistician. 2003; 57(4):229-232.

[28] Yi J, Lee J, Kim KJ, Hwang SJ, Yang E. Why not to use zero imputation? correcting sparsity bias in training neural networks. arXiv preprint arXiv:1906.00150. 2019.

[29] Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):140. doi: 10.1186/s40537-021-00516-9. Epub 2021 Oct 27. PMID: 34722113; PMCID: PMC8549433.

[30] Mohammed MB, Zulkafli HS, Adam MB, Ali N, Baba IA. Comparison of five imputation methods in handling missing data in a continuous frequency table. In AIP Conference Proceedings. AIP Publishing. 2021; 2355:1

[31] Jadhav A, Pramod D, Ramanathan K. Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence. 2019; 33(10):913-933.

[32] Staudemeyer RC, Morris ER. Understanding LSTM--a tutorial into long short-term memory recurrent neural networks. arXiv preprint arXiv:1909.09586. 2019.

Browse by Subjects

Members

Articles

Explore Content

Identify Us

Publish Now

Policies

Manuscript Guidelines

Other Services

Identify Us

Search

Select Language

Explore Section

Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling

Affiliation

要約

数字

参考文献

類似の記事

Page Navigation

研究を公開する

IgMin 科目を探索する

クイックリンク

研究論文

私たちを識別する

今すぐ公開する

その他のサービス

政策

原稿のガイドライン

連絡

Why Publish with IgMin Research?

Enhancing Missing Values Imputation through Transformer-Based Predictive Modeling

Affiliation

要約

数字

参考文献

類似の記事

Most Viewed

Nanorobots in Medicine: Advancing Healthcare through Molecular Engineering:...

The Salt and Dust of the Aral Sea Could Turn Central Asia into A Second Sah...

Revisiting Ice Ages Cycles...

Revisit TBCK-A Pseudo Kinase or a True Kinase...

Efficacy of Alternative Insecticides against Dusky Cotton Bug (Oxycarenus l...

Use of Augmented Reality as a Radiation-free Alternative in Pain Management...

Correlation between Different Factors of Non-point Source Pollution in Yang...

The Role of CCL18 in Rheumatoid Arthritis Diseases...

The Impact of Teledentistry on Modern Dental Practice...

Utilising Phytoremediation in Green Technologies: Exploring Natural Means o...

Mastocytosis: Principles and Pitfalls in the Diagnosis of a Unique Disease...

Study of the Histological Features of the Stroma of High-Grade Gliomas Depe...

A Study of Multi-Pose Effects On a Face Recognition System...

Synergistic Assessment of Supplementation of Ascorbic Acid and Massularia a...

The Influence of Low Pesticide Doses on Fusarium Molds...

Most Latest

Innovative Strategies in the Prevention and Treatment of Peri-implantitis...

A Comprehensive Review of Federated Learning in Cancer Diagnosis and Progno...

Risk of Nutritional Deficiencies and Changes in Dietary Patterns after Bari...

Comparative Analysis of Lattice Pylons and Polygonal Monopods in the SNEL S...

Preparing for SpaceX Mission to Mars...

General Solutions for MHD Motions of Viscous Fluids with Viscosity Linearly...

Multicenter Molecular Integrals over Dirac Wave Functions for Several Funda...

In Biological Research, Single-cell RNA Sequencing Answers the Puzzles arou...

The Impact of Stress on Periodontal Health: A Biomarker-Based Review of Cur...

Maternal Knowledge and Practices in Caring for Children under Five with Pne...

Most Download

The Expressivity Dimension of Speech is the basis of the Expression Dimensi...

Diagnostic Challenges in Pancreatic Tumors...

The use of FIKR (Facet, Insight, Knowledge, and Resilience) Personality as ...

Peritoneal Carcinomatosis from Ovarian Cancer: A Case Report...

Into the Deep: Diving Record for the Dice Snake Natrix tessellata (Laurenti...

Unlawful Homicide of Two Ugly and Disabled Victims in a Japanese Tale Based...

The Examination of Game Skills of Children Aged 5-6 Years Participating in ...

The Relationship between Energy and Climate Warming...

EB Naevi-like Lesion in Infant Bullous Pemphigoid...

The Impact of Teledentistry on Modern Dental Practice...

Gaussian-Transform for the Dirac Wave Function and its Application to the M...

Current Oscillations and Resonances in Nanocrystals of Narrow-gap Semicondu...

On how Doping with Atoms of Gadolinium and Scandium affects the Surface Str...

Dimensioning of Splices Using the Magnetic System...

Enhancing Material Property Predictions through Optimized KNN Imputation an...

Page Navigation

研究を公開する

IgMin 科目を探索する

クイックリンク

IgMinリサーチを購読する

研究論文

Why Publish with IgMin Research?