Selection of Machine Learning Models for Prescription Decision-Making Based on Text Mining - Focusing on Case Studies of Single Prescriptions in Sasang Constitutional Medicine

Article information

J Korean Med. 2025;46(1):70-86
Publication date (electronic) : 2025 March 01
doi : https://doi.org/10.13048/jkm.25006
1Department of Korean Medicine, Sangji University
2Department of Korean Language & Literature, Sogang University
3Department of Sasang Constitutional Medicine, College of Korean Medicine, Sangji University
4Research Institute of Korean Medicine, Sangji University
Correspondence to: Jun-sang Yu, Korean Medicine Hospital of Sangji University, 80 Sangjidae-gil, Wonju-si, Gangwon-do, 26338, Republic of Korea, Tel: +82-33-741-9203, E-mail: hiruok@sangji.ac.kr
§

These authors contributed equally to this work.

Received 2025 January 20; Revised 2025 January 20; Accepted 2025 February 11.

Abstract

Objectives

We analyzed Sasang constitution case reports using text mining and designed a classification algorithm using machine learning to select a model suitable for determining Sasang constitution prescriptions based on text data.

Methods

Case reports on Sasang constitution published from January 1, 2000, to December 31, 2023, were collected. A total of 360 papers and 483 cases were identified, from which text was extracted for 253 cases. The extracted texts were preprocessed and tokenized using the Python-based KoNLPy package, and each morpheme was vectorized using TF-IDF values. To select the most suitable classification model for diagnosing Sasang constitution, the performance of five models—Random Forest Classifier, XGBoost, LightGBM, SVM, and Logistic Regression—was evaluated based on accuracy and F1-Score.

Results

The highest accuracy was achieved by Random Forest Classifier (0.57037), followed by SVM (0.544444), Logistic Regression (0.518519), LightGBM (0.481481), and XGBoost (0.474074). The F1 score was highest for Random Forest Classifier (0.528), followed by SVM (0.52039), Logistic Regression (0.500861), XGBoost (0.45866), and LightGBM (0.446349).

Conclusions

This study is the first to analyze Sasang constitution prescription decisions by applying text mining and machine learning to case reports, providing a concrete research model for follow-up studies. Based on case reports and text data, the most suitable machine learning model for determining Sasang constitution prescriptions is Random Forest Classifier.

tf-idf(t,d)=tf(t,d)×idf(t)idf(t)=log1+nd1+df(d,t)+1
Fig. 1

Study flow of Text-mining and Machine learning

Fig. 2

Flow chart of literature searches and screening results

Fig. 3

Performance Comparison of Different Algorithms

Data Refining Criteria

English Word Translation Exclusion Criteria

Hyperparameter Settings

Best Hyperparameter Settings

Number of Cases Based on Use of Single Prescription by Sasang Constitution

Frequency of Prescriptions by Sasang Constitution (Single Prescription)

Average Accuracy, F1-score, Precision, and Recall of Algorithms

References

1. Chang JY.. 2013;A Study on Research Trends of Graph-Based Text Representations for Text Mining. The Journal of the Institute of Internet Broadcasting and Communication 13(5):37–47. https://doi.org/10.7236/JIIBC.2013.13.5.37.
2. Koo HI.. 2018;AI and Deep Learning Trends. The Korean Institute of Electrical Engineers 67(7):7–12.
3. Kim NG., Lee DH., Choi HC., Wong WXS.. 2017;Investigations on Techniques and Applications of Text Analytics. The Journal of Korean Institute of Communications and Information Sciences 42(2):471–492. https://doi.org/10.7840/kics.2017.42.2.471.
4. Eum SW.. 2020;A Study on Analysis of consumer perception of YouTube advertising using text mining. Management & Information Systems Review 39(2):181–193. https://doi.org/10.29214/DAMIS.2020.39.2.011.
5. Jung M., Lee YL., Yoo CM., Kim JW., Chung JE.. 2019;An exploratory study on consumers’ responses to mobile payment service focused on Samsung Pay. Journal of Digital Convergence 17(1):9–27. https://doi.org/10.14400/JDC.2019.17.1.009.
6. Choi HJ.. 2022;Comparison of Machine Learning Methods for a Prediction of Match Outcomes in Soccer. The Journal of the Korean Society of Measurement and Evaluation in Physical Education and Sports Sceince 24(4):81–91. http://doi.org/10.21797/ksme.2022.24.4.081.
7. Park HS., Lee MS., Hwang SJ., Oh SY.. 2016;TF-IDF Based Association Rule Analysis System for Medical Data. The Transactions of the Korea Information Processing Society 5(3):145–154. https://doi.org/10.3745/KTSDE.2016.5.3.145.
8. Cho SZ., Kang SH.. 2016;Industrial Applications of Machine Learning (Artificial Intelligence). Industrial Engineering Magazine 23(2):34–38.
9. Jang DY., Ha YS., Lee CY., Kim CE.. 2020;Analysis of Symptoms-Herbs Relationships in Shanghanlun Using Text Mining Approach. Journal of Physiology & Pathology in Korean Medicine 34(4):159–169. https://doi.org/10.15188/kjopp.2020.08.34.4.159.
10. Bae HJ., Kim CE., Lee CY., Shin SW., Kim JH.. 2018;Investigation of the Possibility of Research on Medical Classics Applying Text Mining - Focusing on the Huangdi’s Internal Classic -. Journal of Korean Medical classics 31(4):27–46. https://doi.org/10.14369/jkmc.2018.31.4.027.
11. Yea SJ.. 2023;Analysis of Papers on Side-Effects Caused by Herbal Medicine Prescription Using Text Mining: Leveraging PubMed Articles. Journal of Knowledge Information Technology and Systems 18(3):501–511. https://doi.org/10.34163/jkits.2023.18.3.001.
12. Yea SJ., Kim SH.. 2022;An Analysis of the Research Trends of Five Traditional Korean Medicine Prescriptions Using Text Mining: Leveraging PubMed Articles. Journal of Knowledge Information Technology and Systems 17(5):815–823. https://doi.org/10.34163/jkits.2022.17.5.003.
13. Kim JS., Park SH., Jeong RA., Lee ES., Kim YS., Sung HD., Yu JS.. 2024;Application of text-mining technique and machine-learning model with clinical text data obtained from case reports for Sasang constitution diagnosis: a feasibility study. The Journal of Korean Medicine 45(3):193–210. http://dx.doi.org/10.13048/jkm.24049.
14. Park MS., Kim MH., Park SY., Choi IH., Kim CE.. 2022;Individualized Diagnosis and Prescription in Traditional Medicine: Decision-Making Process Analysis and Machine Learning-Based Analysis Tool Development. The American Journal of Chinese Medicine 50(7):1827–1844. https://doi.org/10.1142/S0192415X2250077X.
15. Cho IH., Kwon JH., Lee EJ., Lee JH.. 2020;A Study on Clinical Status for Development of Clinical Practice Guidelines for Sasang Constitutional Medicine Symptomatology. Journal of Sasang Constitution and Immune Medicine 32(4):29–44. https://doi.org/10.7730/JSCM.2020.32.4.29.
16. Jeon TH.. 2022;A linguistic study on tokenization methods for Korean text. Language Facts and Perspectives 55:309–354. https://doi.org/10.20988/lfp.2022.55..309.
17. Park HJ.. 2020;Trend Analysis of Korea Papers in the Fields of ‘Artificial Intelligence’, ‘Machine Learning’ and ‘Deep Learning. Journal of Korea Institute of Information, Electronics, and Communication Technology 13:283–292. http://doi.org/10.17661/jkiiect.2020.13.4.283.
18. Lee JH., Lee MB., Kim JW.. 2019;A study on Korean language processing using TF-IDF. The Journal of Information Systems 28(3):105–121. http://dx.doi.org/10.5859/KAIS.2019.28.3.105.
19. Park SE., Gang JY.. 2022. Python Text Mining Complete Guide 1st Editionth ed. Gyeonggi: Wikibooks. p. 322.
20. Hong KH.. 2020;A Predictive Model for Suicidal Ideation of Adolescents Using Random Forests Machine Learning Algorithm. Korean Journal of Social Welfare 72(3):157–180. https://doi.org/10.20970/kasw.2020.72.3.007).
21. Bae JS., Kim SB.. 2021;Predictions of COVID-19 in Korea Using Machine Learning Models. Journal of the Korean Institute of Industrial Engineers 47(3):272–279. https://doi.org/10.7232/JKIIE.2021.47.3.272.
22. Hah DW., Kim YM., Ahn JJ.. 2019;A study on KOSPI 200 direction forecasting using XGBoost model. Journal of the Korean Data And Information Science Sociaty 30(3):655–669. http://dx.doi.org/10.7465/jkdi.2019.30.3.655.
23. Hwang YJ., Son SE., Lee ZK.. 2024;Prediction of Stock Returns from News Article’s Recommended Stocks Using XGBoost and LightGBM Models. Journal of The Korea Society of Computer and Information 29(2):51–59. http://doi.org/10.9708/jksci.2024.29.02.051.
24. Park SY., Chung HW.. 2020;Exploring Variables Affecting Career Decision of Middle School Students: An Application of Machine Learning Approaches. Asian Journal of Education 21(3):727–753. http://doi.org/10.15753/aje.2020.09.21.3.727.
25. Kim PS., Lee SH.. 2023;Application of AI Machine Learning Algorithms to Predict Korea Ladies Professional Golf Association (KLPGA) Players Top 10 Ranking: A Sports Analytics Perspective. Korean Journal of Sport Management 28(4):51–66. http://doi.org/10.31308/KSSM.28.4.51.
26. Jung MH., Kwon WH.. 2021;Present Status and Future of AI-based Drug Discovery. Journal of the Korea Institute Of Information and Communication Engineering 25(12):1797–1808. http://doi.org/10.6109/jkiice.2021.25.12.1797.
27. Lee JH.. 2022. Korean Medicine Clinical Practice Guideline for Sasang(Four) constitutional medicine patterns Korea: The Society of Sasang Constitutional Medicine.
28. National Institute of Korean Medicine Development. 2017 Korean Medicine Health Service Utilization and Consumption Survey Seoul: 2018.
29. Lee JH., Lee HH.. 2019;Selecting Sasang-Type classification model using machine learning and designing the service flow. Journal of Digital Contents Society 20(2):321–327. http://dx.doi.org/10.9728/dcs.2019.20.2.321.
30. Rácz A., Bajusz D., Héberger K.. 2021;Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification. Molecules 26(4):1111. https://doi.org/10.3390/molecules26041111.
31. Oh SM., Park MS.. 2024;Machine Learning-based Phishing Website Detection Model. The Journal of the Convergence on Culture Technology 10(4):575–580. http://dx.doi.org/10.17703/JCCT.2024.10.4.575.
32. Han KS., Park SS.. 2002;Research about The Discourse on The Discourse on The Medications and Prescriptions on The ShinChukBon DongyiSuseBowon. Journal of Sasang Constitution and Immune Medicine 14(3):52–73.
33. Department of Sasang Constitutional Medicine, College of Korean Medicine. 2004;Sasang constitutional medicine. Jipmoon :53.
34. Kang MS., Oh JW., Lee HR., Lee JH.. 2019;Patient Group Study to Improve the Accuracy of QSCC II+. Journal of Sasang Constitution and Immune Medicine 31(3):48–65. http://doi.org/10.7730/JSCM.2019.31.3.48.

Article information Continued

Fig. 1

Study flow of Text-mining and Machine learning

Fig. 2

Flow chart of literature searches and screening results

Fig. 3

Performance Comparison of Different Algorithms

Table 1

Data Refining Criteria

Criteria Example
Before After
Compound words Cases where a compound word is perceived as separate components Cold, Sweat Cold sweat
Hyung, Geumji, Pose Hyunggeumjipose
Cases where multiple words should be considered as a single phrase Abdominal, Bloating Abdominal bloating
Nocturnal, sleep, disorder Nocturnal sleep disorder
Synonyms Cases with the same or similar meanings but different spellings Feel dizzy, Dizziness, Lightheadedness Vertigo
Cases where a single word represents or encompasses other words Sleep disorder, Difficulty falling asleep, Nocturnal sleep disorder, Insomnia, Sleep difficulties, Difficulty falling asleep Sleep disorder
Stop words Not a key variable, and used conventionally Above-mentioned, Opinion, Usually, And, When, Time, Patient Delete

Table 2

English Word Translation Exclusion Criteria

Translation exclusion criteria Example
Words written in English in most of the research papers VAS, QSCC II
Words that represents a unit kg, cm
Name of the medicine trolac, NSAID

Table 3

Hyperparameter Settings

Algorithm Hyper Parameter Input Value
Random Forest Classifier n_estimators 50, 200, 500, 1000

XGBoost learning_rate 0.01, 0.1, 0.2
n_estimators 100, 200

LightGBM learning_rate 0.01, 0.1, 0.2
n_estimators 100, 200

Support Vector Machine C 0.1, 1, 10, 20
kernel ‘linear’, ‘rbf’, ‘sigmoid’, ‘poly’
degree 2, 3, 4

Logistic Regression C 0.1, 1, 10, 20

Table 4

Best Hyperparameter Settings

Algorithm Hyper Parameter Best Hyper Parameter Best CV F1 Score
Random Forest Classifier n_estimators 500 0.508679386

XGBoost learning_rate 0.1 0.490174846
n_estimators 200

LightGBM learning_rate 0.1 0.456298434
n_estimators 200

Support Vector Machine C 10 0.484772635
kernel ‘sigmoid’
degree 2

Logistic Regression C 20 0.482295398

Table 5

Number of Cases Based on Use of Single Prescription by Sasang Constitution

Number of case studies with a single prescription Number of case studies with compound prescriptions Total number of case studies
Taeyangin 14 4 18
Taeeumin 73 81 154
Soyangin 111 105 216
Soeumin 55 40 95
Total 253 230 483

Table 6

Frequency of Prescriptions by Sasang Constitution (Single Prescription)

Taeyangin Taeeumin Soyangin Soeumin
Mihudeungsikjang-tang 11 Yeoldahanso-tang 18 Yanggyeoksanhwa-tang 21 Gwakhyangjeonggi-san 14
Ogapijangcheok-tang 2 Cheongsimyeonja-tang 16 Hyeongbangjihwang-tang 18 Sibimigwanjung-tang 10
Yeoldahanso-tang 1 Taeumjowi-tang 9 Hyeongbangdojeok-san 18 Hyangbujapalmul-tang 3
Jowiseungcheong-tang 8 Hyeongbangsabaek-san 17 Palmulgunja-tang 3
Galgeunhaegi-tang 7 Dojeokganggi-tang 7 Seungyangikgibuja-tang 3
Gwache 3 Dokhwaljihwang-tang 7 Geopung-san 2
Joripyeowon-tang 2 Yangdokbaekho-tang 6 Hyangsayangwi-tang 2
Cheongpyesagan-tang 2 Gamsumal 5 Cheongunggyeji-tang 2
Geonyuljeotang-tang 1 Hyeongbangpaedok-san 4 Osuyubujaijung-tang 2
Mahwangjeongcheon-tang 1 Yukmijihwang-tang 1 Doksampalmulgunja-tang 2
Mankgeummunmu-tang 1 Jeoryeongchajeon-tang 1 Bojungikgi-tang 2
Seunggeumjowi-tang 1 Jihwangbaekho-tang 1 Oryeong-san 1
Seunggijowi-tang 1 Palmulgunja-tang 1 Seonghyangjeonggi-san 1
Cheongrijagam-tang 1 Sukjihwanggosam-tang 1 Seungyangikgi-tang 1
Cheonghyeolgangih-tang 1 Saenghwa-tang 1 Samgyepalmul-tang 1
Handayeolso-tang 1 Ganghwajihwang-tang 1 Doksamgwangyebujaijung-tang 1
Gamijihwang-tang 1 Dangguibaekhaoogwanjung-tang 1
unggihyangso-san 1
Gunggichiseup-tang 1
Gwangyebujaijung-tang 1
Hwanggyeogyeji-tang 1
14 73 111 55

Table 7

Average Accuracy, F1-score, Precision, and Recall of Algorithms

Algorithm Average
Accuracy
Average
F1 score
Average
Precision
Average
Recall
Random Forest Classifier 0.57037 0.528 0.548862 0.557407
XGBoost 0.474074 0.45866 0.477778 0.482407
LightGBM 0.481481 0.446349 0.454074 0.478704
Support Vector Machine 0.544444 0.52039 0.548003 0.546296
Logistic Regression 0.518519 0.500861 0.533915 0.52037