This study aimed to select features for identifying Korean grammatical collocations and explored the feasibility of classifying grammatical collocations using a machine-learning model. Candidate grammatical collocations whose nodes are nouns were extracted from the Sejong Corpus. Six features were identified: the frequency of lexical chains, the entropy of adjacent words in lexical chains, the average distance and variance between components of lexical chains, and the entropy of preceding and following elements of lexical chains whose nodes are nouns. To address the class imbalance in the training data, eight sampling techniques were applied. Both the original and sampled datasets were trained using XGBoost to develop a classification model and evaluate its performance. The results indicated that the model trained on data sampled using a combination of SMOTE and ENN exhibited the highest accuracy and better classified the minority class of grammatical collocations. The most influential features were 'frequency', 'minimum entropy', and 'variance of average distance'. However, limitations were identified, as the frequency and information of adjacent words alone were insufficient to fully capture the contextual meaning and grammatical characteristics of collocations. Future research should utilize transformer-based pre-trained models and embedding techniques to extract features that reflect the contextual meaning and functions of grammatical collocations. This approach is expected to facilitate an objective review of grammatical collocation lists for Korean language education and establish unified criteria for identifying grammatical collocations.
