Romanica Olomucensia 2023, 35(1):23-40 | DOI: 10.5507/ro.2023.003

Etiquetaje de expresiones multipalabra en ensayos escritos por nativos y no nativos de español en un curso de desarrollo de gramática y composición

Miguel Da Corte, Jorge Baptista
Universidade do Algarve, Portugal
INESC-ID Lisboa, Portugal

La literatura sobre el aprendizaje de una segunda lengua postula que existen diferencias significativas entre el uso de expresiones multipalabra (EMP) por parte de hablantes nativos (HN) y no nativos (HNN). Además, considera que los niveles de competencia lingüística pueden estimarse a partir del uso de dichas expresiones. En este trabajo se analiza la producción escrita de un corpus de ensayos escritos por hablantes nativos (16 ensayos, 5.839 palabras) y no nativos de español (25 ensayos, 7.767 palabras) matriculados en un curso centrado en el desarrollo de las habilidades ortográficas, gramaticales, léxicas, semánticas y discursivas en español. Este es un curso de matriculación obligatoria para los estudiantes que aspiran a obtener el título de traductor o intérprete (español/inglés) en el centro educativo donde se realizó el estudio. Dos expertos lingüistas etiquetaron de forma manual el corpus del estudio. El esquema de clasificación utilizado se inspiró en otros esquemas encontrados en la literatura y construido con fines similares. Los resultados mostraron que no se encontraron mayores diferencias (correlación de Pearson: 0,894) en la distribución de los tipos de EMP en el análisis del corpus de HN y HNN. Sin embargo, se observaron diferencias interesantes en algunas categorías, concretamente entre las expresiones fraseológicas verbales y los sustantivos comunes compuestos. Aunque el corpus es pequeño para llegar a conclusiones más relevantes, cabe destacar que los diferentes tipos de EMP se distribuyen de forma desigual en los textos escritos por los hablantes nativos y no nativos y que algunas categorías son un indicador más claro de un nivel de competencia más cercano al nivel de dominio del idioma materno.

Palabras clave: expresiones multipalabra; competencia lingüística; nivel de clasificación; modelos de aprendizaje automático; cursos para el desarrollo de habilidades (en español)

Multiword expression tagging of Spanish native and non-native speakers' written essays in a grammar and composition developmental course

The literature on second language learning posits that there are significant differences between the use of multiword expressions (MWE) by native speakers (NS) and non-native speakers (NNS). Furthermore, it considers that levels of language proficiency can be estimated on the basis of the use of these expressions. This paper analyses the written production from a corpus of essays written by native (16 essays, 5839 words) and non-native Spanish speakers (25 essays, 7767 words) enrolled in a course focused on the development of orthographic, grammatical, lexical, semantic, and discursive skills in Spanish. This is a required course for students pursuing a certification in Translating or Interpreting (Spanish/English) in the educational setting where the study took place. The corpus was manually tagged by two linguists. The classification scheme used was inspired by other schemes found in the literature and built for similar purposes. The results show that, in general, the distribution of MWE types found in the NS and NNS partition of the corpus was not very different (Pearson correlation: 0.894). However, interesting differences were found between the categories of verbal idioms and noun constructions. Though the corpus is too small for more significant conclusions to be drawn, it is possible to point out that different types of MWE are unevenly distributed among the native speakers' and non-native learners' written production material, and some categories may be a clearer indicator of near-native-speaker proficiency.

Keywords: multiword expressions; language proficiency; classification level; machine- learning models; developmental education courses (in Spanish)

Received: November 12, 2022; Revised: November 12, 2022; Accepted: May 1, 2023; Published: July 26, 2023  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Da Corte M, Baptista J. Multiword expression tagging of Spanish native and non-native speakers' written essays in a grammar and composition developmental course. Romanica Olomucensia. 2023;35(1):23-40. doi: 10.5507/ro.2023.003.
Download citation

References

  1. ACTFL (2016), Assigning CEFR ratings to ACTFL assessments [disponible en , 7/3/2023].
  2. Alexopoulou, Theodora - Michel, Marije - Murakami, Akira - Meurers, Detmar (2017), «Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques», Language Learning 67(S1), 180-208. Go to original source...
  3. Alpine Testing Solutions (2020), Examination of the ACTFL Writing Proficiency Test (WPT) in English, Russian, and Spanish for the ACE Review - Part B: Statistical Analysis & Evidence of Validity, Orem, UT: Alpine Testing Solutions.
  4. Arnold, Taylor - Ballier, Nicolas - Gaillat, Thomas - Lissòn, Paula (2018), «Predicting CEFRL levels in learner English on the basis of metrics and full texts», Proceedings of the 20th Conférence Sur l'Apprentissage Automatique. INSA de Rouen, 20-22 June 2018, ArXiv:1806.11099. Go to original source...
  5. Baptista, Jorge - Mamede, Nuno - Reis, Sonia (2022), «Support Verb Constructions across the Ocean Sea», Proceedings of the 18th Workshop on Multiword Expressions @ LREC2022, Marseille, France. European Language Resources Association, 26-36.
  6. Consejo de Europa (2002/2020), Marco Común Europeo de Referencia para las Lenguas: Aprendizaje, Enseñanza, Evaluación, [disponible en ; , 7/3/2023].
  7. Corpas Pastor, Gloria (2017), «Collocational constructions in translated Spanish: what corpora reveal», en Mitkov, R. (ed.), Computational and Corpus-Based Phraseology. Second International Conference, Europhras 2017, Londres: Springer, 29-40. Go to original source...
  8. Da Corte, Miguel - Baptista, Jorge (2022a), «A Phraseology Approach in Developmental Education Placement», en Corpas Pastor, G. - Mitkov, R. - Kunilovskaya, M. - Caro Quintana, R. (eds.) Computational and Corpus-based Phraseology, Proceedings of EUROPHRAS 2022, Malaga, September 28-30, 2022, Londres: Springer, 79-86.
  9. Da Corte, Miguel - Baptista, Jorge (2022b), «Lista de expresiones multipalabra detectadas en ensayos escritos en un curso de desarrollo de gramática y composición» [disponible en , 7/3/2023].
  10. Dahunsi, Toyese Najeem - Ewata, Thompson Olusegun (2022), «An exploration of the structural and colligational characteristics of lexical bundles in L1-L2 corpora for English language teaching», Language Teaching Research 1-17 [disponible en , 7/3/2023]. Go to original source...
  11. Dem¹ar, Janez - Curk, Toma¾ - Erjavec, Ale¹ - Gorup, Èrt - Hoèevar, Toma¾ - Milutinoviè, Mitar - Mo¾ina, Martin - Polajnar, Matija - Toplak, Marko - Stariè, An¾e - Stajdohar, Miha - Umek, Lan - ®agar, Lan - ®bontar, Jure - ®itnik, Marinka - Zupan, Bla¾ (2013), «Orange: data mining toolbox in Python», The Journal of machine Learning research 14(1), 2349-2353.
  12. El-Dakhs, Dina Abel Salam - Khan, Shazia Khalid - Al-Khodair, Maram (2022), «Do foreign language learners mine input texts for multiword expressions? The case of writing story retellings», Ampersand 9, 100080. Go to original source...
  13. Erdmann, Alexander - Wrisley, David Joseph - Brown, Christopher - Cohen-Bodénès, Sophie - Elsner, Micha - Feng, Yukun - Brian, Joseph - Joyeux-Prunel, Béatrice - de Marneffe, Marie-Catherine (2019), «Practical, efficient, and customizable active learning for named entity recognition in the digital humanities», Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, 2223-2234. Go to original source...
  14. Esfandiari, Rajab - Ahmadi, Mohammad (2022), «Phraseological Complexity and Academic Writing Proficiency in Abstracts Authored by Student and Expert Writers», English Teaching & Learning, 1-20. Go to original source...
  15. Farahmand, Meghdad - Smith, Aaron - Nivre, Joakim (2015), «A multiword expression data set: annotating non-compositionality and conventionalization for English noun compounds», Proceedings of the 11th Workshop on Multiword Expressions, 29-33. Go to original source...
  16. Fotopoulou, Aggeliki - Laporte, Éric - Nakamura, Takuya (2021), «Where Do Aspectual Variants of Light Verb Constructions Belong?», Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), 2-12. Go to original source...
  17. García-Page, Mario (2008), Introducción a la fraseología española: estudio de las locuciones, Barcelona: Anthropos.
  18. Gross, Maurice (1996), «Lexicon-grammar», en Brown, K. - Miller, J. (eds.), Concise Encyclopedia of Syntactic Theories, Cambridge: Pergamon, 244-259.
  19. HaCohen-Kerner, Yaakov - Miller, Daniel - Yigal, Yair (2020), «The influence of preprocessing on text classification using a bag-of-words representation». PloS ONE 15(5), e0232525 [disponible en , 7/3/2023]. Go to original source...
  20. Hernández, Mireia - Costa, Alber - Arnon, Inbal (2016), «More than words: multiword frequency effects in non-native speakers», Language, Cognition and Neuroscience 31(6), 785-800. Go to original source...
  21. Hinkel, Eli (2023). «Teaching and Learning Multiword Expressions», Handbook of Practical Second Language Teaching and Learning, Nueva York: Routledge, 435-448. Go to original source...
  22. Kochmar, Ekaterina - Gooding, Sian - Shardlow, Matthew (2020), «Detecting multiword expression type helps lexical complexity assessment», LREC 2020: Proceedings of the 12th Conference on Language Resources and Evaluation, The European Language Resources Association (ELRA), 4426-4435.
  23. Laporte, Éric (2018), Choosing features for classifying multiword expressions, en Sailer, M. - Markantonatou, S. (eds.), Multiword expressions: Insights from a multi-lingual perspective, Berlín: Language Science Press, 143-186.
  24. Nam, Daeyeon - Park, Kwanghyun (2020), «I will write about: Investigating multiword expressions in prospective students' argumentative writing», Plos one 15(12), e0242843. Go to original source...
  25. Pasquer, Caroline - Savary, Agata - Ramisch, Carlos - Antoine, Jean-Yves (2020), «Verbal Multiword Expression Identification: Do We Need a Sledgehammer to Crack a Nut?», Proceedings of the 28th International Conference on Computational Linguistics, 3333-3345. Go to original source...
  26. Savary, Agata - Cordeiro, Silvio Ricardo - Ramisch, Carlos (2019), «Without lexicons, multiword expression identification will never fly: A position statement», Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019). Association for Computational Linguistics, 79-91. Go to original source...
  27. Siyanova-Chanturia, Anna - Spina, Stefania (2020), «Multi-word expressions in second language writing: A large-scale longitudinal learner corpus study», Language Learning 70(2), 420-463. Go to original source...
  28. Wilkens, Rodrigo - Seibert, Daiane - Wang, Xiaoou - François, Thomas (2022), «MWE for Essay Scoring English as a Foreign Language», Proceedings of the 2nd Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI) within the 13th Language Resources and Evaluation Conference, 62-69.

This is an open access article distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.