Yuval Marton

News

October 2025: I’m proud of my ex-Master’s student to have taken the time to rerun revised experiments, and write a new version of our auto-regressive LLM-powered thematic fit estimation paper (the first of its kind). To be posted / submitted soon.
In July 2025 I volunteered to serve as an Associate Editor for the ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP).
In January 2025 I’ve been appointed as a Research Co-Supervisor of a Computer Science and Engineering PhD student. This is a unique journey for both my student (who’s working full-time in parallel), and me (as a non-tenured advisor). We will be working on our first paper together towards the end of this year.
June–November 2024: I served as Area Chair, Multilingual Work and Translation. CoNLL 2024 conference, collocated with EMNLP 2024.

I am a researcher (a.k.a. research scientist), technical mentor and consultant in Data Science (DS) and Artificial Intelligence (AI), Machine Learning / Deep Learning (ML/DL), with specialty in Computational Linguistics / Natural Language Processing and Understanding (NLP/NLU), and human language technologies (HLT). I am also an affiliate professor at the University of Washington. I have an entrepreneurial side as well, and have worked at both startups and large corporations (IBM, Microsoft, Morgan Stanley, Bloomberg).

My research interests include semantic processing: semantic role labeling (SRL), semantic fit, and lexical semantics -- corpus-based semantic similarity measures, paraphrase generation and document understanding, as well as syntactic parsing and statistical machine translation. At some point I got involved also in dialog systems / bots / NLU (for sales / task completion) and information retrieval (search engine result ranking). I am interested in using and adapting machine learning methods, including deep learning for NLP/NLU – and using linguistically informed learning bias and feature design to make such ML-with-NLP methods more effective. I enjoy mentoring individuals and teams how to take their ideas and domain knowledge (say, in finance) and use best practices in Data Science, NLP and AI in order to increase automation in their pipelines, define useful metrics to guide their efforts, and use human labeling to measure their success.

Email: yuvalmarton @t gmail.com

Twitter: @yuvalmarton

Profiles: Google Scholar, Semantic Scholar, LinkedIn.

Brief Academic Bio

Apart from my position in the "industry", I am an affiliate professor at the University of Washington, teaching NLP/MT courses, and advising graduate students.

I was a Post-Doctoral Research Scientist at the Columbia University Center for Computational Learning Systems (CCLS), where I worked with Nizar Habash and Owen Rambow on syntactic parsing, focusing on Arabic parsing for statistical machine translation (SMT), including subject detection, and morphological features for parsing.

I received my Ph.D. in linguistics from University of Maryland (UMD) in 2009. My advisors were Philip Resnik and Amy Weinberg, and my focus was on computational linguistics. My dissertation, entitled “Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models”, explored using soft syntactic and semantic constraints in end-to-end state-of-the-art statistical machine translation systems. It also introduced a novel distributional paraphrase generation technique that can benefit from soft semantic constraints, and can generate paraphrases of arbitrary length on-the-fly, with dynamically set context length. Last, my dissertation presented a generalized framework, of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined.

Following my interests in neuro-biologically plausible cognitive and linguistic models, I took several fascinating neuroscience courses at the Neuroscience and Cognitive Science (NACS) Program, and received the NACS Certificate. My qualifying paper focused on visual word recognition. I argued there for a lexical representation consisting of both lower-level visual features and higher-level abstract letter objects, interacting with statistical factors (word frequency) and partly innate factors (left or right visual field perception). During the second half of my studies, I did research in this area with Carol Whitney.

Further back in time, I dabbled in text classification research (authorship attribution and topic / genre classification). My previous-previous advisor was Lisa Hellerstein, back when I was a computer science graduate student at the Polytechnic Institute of NYU (formerly Polytechnic University, Brooklyn, NY), where I received my Computer Science Masters.

Publications

See my Google Scholar profile, Semantic Scholar profile, and/or the list below:

Journal Articles, Book Chapters, Edited Volumes

Yuval Marton, Imed Zitouni. “Transliteration normalization for Information Extraction and Machine Translation”. Journal of King Saud University - Computer and Information Sciences. Volume 26, Issue 4, December 2014, Pages 379–387. DOI: 10.1016/j.jksuci.2014.06.011

Foreign name transliterations typically include multiple spelling variants. These variants cause data sparseness and inconsistency problems, increase the Out-of-Vocabulary (OOV) rate, and present challenges for Machine Translation, Information Extraction and other natural language processing (NLP) tasks. This work aims to identify and cluster name spelling variants using a Statistical Machine Translation method: word alignment. The variants are identified by being aligned to the same “pivot” name in another language (the source-language in Machine Translation settings). Based on word-to-word translation and transliteration probabilities, as well as the string edit distance metric, names with similar spellings in the target language are clustered and then normalized to a canonical form. With this approach, tens of thousands of high-precision name transliteration spelling variants are extracted from sentence-aligned bilingual corpora in Arabic and English (in both languages). When these normalized name spelling variants are applied to Information Extraction tasks, improvements over strong baseline systems are observed. When applied to Machine Translation tasks, a large improvement potential is shown.

Mona Diab and Yuval Marton. “Semantic Processing of Semitic Languages”. Book chapter, in “Semitic Language Processing”, Imed Zitouni, Ed. Springer, March 2014, pp 129-159. ISBN: 978-3-642-45357-1 (Print) 978-3-642-45358-8 (Online)

In this chapter, we cover semantic processing in Semitic languages. We will present models of semantic processing over words and their relations in sentences, namely paradigmatic and syntagmatic models. We will contrast the processing of Semitic languages against English, illustrating some of the challenges–and clues–due to the inherent unique characteristics of Semitic languages.

Yuval Marton. “Distributional Phrasal Paraphrase Generation for Statistical Machine Translation”. ACM Transactions on Intelligent Systems and Technology (TIST) special issue on paraphrasing. Eds.: Haifeng Wang, Bill Dolan, Idan Szpektor, Shiqi Zhao. Volume 4, Issue 3, June 2013.

Paraphrase generation has been shown useful for various natural language processing tasks, including statistical machine translation. A commonly used method for paraphrase generation is pivoting [Callison-Burch et al. 2006], which benefits from linguistic knowledge implicit in the sentence alignment of parallel texts, but has limited applicability due to its reliance on parallel texts. Distributional paraphrasing [Marton et al. 2009] has wider applicability, is more language-independent, but doesn’t benefit from any linguistic knowledge. Nevertheless, we show that distributional paraphrasing can yield greater gains. We report method improvements leading to higher gains than previously published – almost 2 BLEU points, and provide implementation details, complexity analysis, and further insight into this method.

Yuval Marton, Nizar Habash and Owen Rambow. “Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features”. Computational Linguistics, Volume 39, Issue 1, pages 161-194. Posted Online March 1, 2013.

We explore the contribution of lexical and inflectional morphology features to dependency parsing of Arabic, a morphologically rich language with complex agreement patterns. Using controlled experiments, we contrast the contribution of different part-of-speech (POS) tagsets and morphological features in two input conditions: machine-predicted condition (in which POS tags and morphological feature values are automatically assigned), and gold condition (in which their true values are known).We find that more informative (fine-grained) tagsets are useful in the gold condition, but may be detrimental in the predicted condition, where they are outperformed by simpler but more accurately predicted tagsets. We identify a set of features (definiteness, person, number, gender, and undiacritized lemma) that improve parsing quality in the predicted condition, while other features are more useful in gold. We are the first to show that functional features for gender and number (e.g., “broken plurals”), and optionally the related rationality (“humanness”) feature, are more helpful for parsing than form-based gender and number. We finally show that parsing quality in the predicted condition can dramatically improve by training in a combined gold+predicted condition. We experimented with two transition-based parsers, MaltParser and Easy-First Parser. Our findings are robust across parsers, models and input conditions. This suggests that the contribution of the linguistic knowledge in the tagsets and features we identified goes beyond particular experimental settings, and may be informative for other parsers and morphologically rich languages.

Marianna Apidiansky, Ido Dagan, Jennifer Foster, Yuval Marton, Djame Seddah, Reut Tsarfaty. "Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages" (SP-Sem-MRL). Proceedings of the 50th meeting of the Association for Computational Linguistics (ACL), Jeju Island, Korea, July 12, 2012. PDF

Morphologically Rich Languages (MRLs) are languages in which grammatical relations such as Subject, Predicate, and Object, are largely indicated morphologically (e.g., through inflection) instead of positionally. This poses serious challenges for current (English-centric) syntactic and semantic processing. Furthermore, since grammatical relations provide the interface to compositional semantics, morpho-syntactic phenomena may significantly complicate processing the syntax–semantics interface. In statistical parsing, English parsing performance has reached a high plateau in certain genres. Semantic processing of English has similarly seen much progress in recent years. MRL processing presents new challenges, such as optimal morphological representation, non position-centric algorithms, or different semantic distance measures. …

Yuval Marton, David Chiang, and Philip Resnik. “Soft Syntactic Constraints for Arabic-English Hierarchical Phrase-Based Translation”. Machine Translation Journal Special Issues on Machine Translation for Arabic. Editor-in-Chief: Andy Way, Guest Co-Editors: Nizar Habash and Hany Hassan. Paginated version: Volume 26, Issue 1 (2012), pages 137-157. Online version: Journal no. 10590, 29 October 2011

In adding syntax to statistical machine translation, there is a tradeoff between taking advantage of linguistic analysis and allowing the model to exploit parallel training data with no linguistic analysis: translation quality versus coverage. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment. We present an approach that explores the tradeoff from the other direction, starting with a translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language. We argue that in order for these constraints to improve translation, they must be fine-grained: the constraints should vary by constituent type, and by the type of match or mismatch with the parse. We also use a different feature weight optimization technique, capable of handling large amount of features, thus eliminating the bottleneck of feature selection. We obtain substantial improvements in performance for translation from Arabic to English.

Marine Carpuat, Yuval Marton, and Nizar Habash. “Reordering Post-verbal Subjects for Arabic-to-English Statistical Machine Translation”. Machine Translation Journal Special Issues on Machine Translation for Arabic. Editor-in-Chief: Andy Way, Guest Co-Editors: Nizar Habash and Hany Hassan. Paginated version: Volume 26, Issue 1 (2012), pages 105-120. Online version: Journal no. 10590, 8 November 2011

We study challenges raised by the order of Arabic verbs and their subjects in Statistical Machine Translation (SMT). We show that the boundaries of post-verbal subjects (VS) are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. In addition, VS constructions have highly ambiguous reordering patterns when translated to English, and these patterns are very different for matrix (main clause) VS and non-matrix (subordinate clause) VS. Based on this analysis, we propose a novel method for leveraging VS information in SMT: we reorder VS constructions into SV order for word alignment. Unlike in previous approaches to source-side reordering, phrase extraction and decoding are performed using the original Arabic word order. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline. Limiting reordering to matrix VS yields further improvements.

Yuval Marton, Ning Wu, and Lisa Hellerstein. "On Compression-Based Text Classification". Advances in Information Retrieval, Lecture Notes in Computer Science, Volume 3408, 2005, pages 300-314. Abstract. Full paper here or here. Click here for the errata note!

Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.
See my ECIR paper below.

Conferences

Yuval Marton and Asad Sayeed. “Thematic fit bits: Annotation quality and quantity for event participant representation”. LREC 2022. (arXiv copy)

Modeling thematic fit (a verb--argument compositional semantics task) currently requires a very large burden of data. We take a high-performing neural approach to modeling verb--argument fit, previously trained on a linguistically machine-annotated large corpus, and replace corpus layers with output from higher-quality taggers. Contrary to popular beliefs that, in the deep learning era, more data is as effective as higher quality annotation, we discover that higher annotation quality dramatically reduces our data requirement while demonstrating better supervised predicate-argument classification. But in applying the model to a psycholinguistic task outside the training objective, we saw only small gains in one of two thematic fit estimation tasks, and none in the other. We replicate previous studies while modifying certain role representation details, and set a new state-of-the-art in event modeling, using a fraction of the data.

Yuval Marton and Kristina Toutanova. “E-TIPSY: Search Query Corpus Annotated with Entities, Term Importance, POS Tags, and Syntactic Parses”. The Tenth Edition of the Language Resources and Evaluation Conference (LREC), 23-28 May 2016, Portorož (Slovenia).

We present E-TIPSY, a search query corpus annotated with named Entities, Term Importance, POS tags, and SYntactic parses. This corpus contains crowdsourced (gold) annotations of the three most important terms in each query. In addition, it contains automatically produced annotations of named entities, part-of-speech tags, and syntactic parses for the same queries. This corpus comes in two formats: (1) Sober Subset: annotations that two or more crowd workers agreed upon, and (2) Full Glass: all annotations. We analyze the strikingly low correlation between term importance and syntactic headedness, which invites research into effective ways of combining these different signals. Our corpus can serve as a benchmark for term importance methods aimed at improving search engine quality and as an initial step toward developing a dataset of gold linguistic analysis of web search queries. In addition, it can be used as a basis for linguistic inquiries into the kind of expressions used in search

Junhui Li, Yuval Marton, Hal Daume III, and Philip Resnik. “A Unified Model for Soft Linguistic Reordering Constraints in Statistical Machine Translation”. The 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, June 25-29, 2014. Online version.

This paper explores a simple and effective unified framework for incorporating soft linguistic reordering constraints into a hierarchical phrase-based translation system: 1) a syntactic reordering model that explores reordering for context free grammar rules; and 2) a semantic reordering model that focuses on the reordering of predicate-argument structures. We develop novel features based on both models and use them as soft constraints to guide the translation process. Experiments on Chinese-English translation show that the reordering approach can significantly improve a state-of-the-art hierarchical phrase-based translation system. However, the gain achieved by the semantic reordering model is limited in the presence of the syntactic reordering model, and we therefore provide a detailed analysis of the behavior differences between the two.

Vladimir Eidelman, Yuval Marton, and Philip Resnik. “Online Relative Margin Maximization for Statistical Machine Translation”. The 51st Annual Meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, August 4-9, 2013.

Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread of the projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and Arabic-English translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant improvements of up to 1.5 BLEU and 5.9 TER over state-of-the-art optimizers.

Yuval Marton, Nizar Habash and Owen Rambow. “Improving Arabic Dependency Parsing with Lexical and Inflectional Surface and Functional Features”. The 49th Annual Meeting of the Association for Computational Linguistics (ACL), Portland, Oregon, USA, June 19-24, 2011. Full paper.

We explore the contribution of lexical and morphological features to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and un-diacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of surface and functional features, and show that functional features for gender and number (e.g., “broken plurals”) and the related rationality feature improve over surface-based features. It is the first time these functional features are used for Arabic NLP.

Hao Li, Xiang Li, Heng Ji, and Yuval Marton. “Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation”. The 24th Pacific Asia Conference on Language, Information and Computation (PACLIC), Sendai, Japan, November 4-7, 2010. Full paper.

Information Extraction (IE) is becoming increasingly useful, but it is a costly task to discover and annotate novel events, event arguments, and event types. We exploit both monolingual texts and bilingual sentence-aligned parallel texts to cluster event triggers and discover novel event types. We then generate event argument annotations semi-automatically, framed as a sentence ranking and semantic role labeling task. Experiments on three different corpora -- ACE, OntoNotes and a collection of scientific literature -- have demonstrated that our domain-independent methods can significantly speed up the entire event discovery and annotation process while maintaining high quality.

Yuval Marton. “Improved Statistical Machine Translation Using Monolingual Text and a Shallow Lexical Resource for Hybrid Phrasal Paraphrase Generation. The Ninth Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado, October 31 – November 5, 2010. Full paper.

Paraphrase generation is useful for various NLP tasks. But pivoting techniques for paraphrasing have limited applicability due to their reliance on parallel texts, although they benefit from linguistic knowledge implicit in the sentence alignment. Distributional paraphrasing has wider applicability, but doesn’t benefit from any linguistic knowledge. We combine a distributional semantic distance measure (based on a non-annotated corpus) with a shallow linguistic resource to create a hybrid semantic distance measure of words, which we extend to phrases. We embed this extended hybrid measure in a distributional paraphrasing technique, benefiting from both linguistic knowledge and independence from parallel texts. Evaluated in statistical machine translation tasks by augmenting translation models with paraphrase-based translation rules, we show our novel technique is superior to the non-augmented baseline and both the distributional and pivot paraphrasing techniques. We train models on both a full-size dataset as well as a simulated “low density” small dataset.

Marine Carpuat, Yuval Marton, and Nizar Habash. “Improving Arabic-to-English Statistical Machine Translation by Reordering Post-verbal Subjects for Alignment”. The 48th Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden, July 11–16, 2010. Short paper.

We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.

Marine Carpuat, Yuval Marton, Nizar Habash. “Reordering Matrix Post-verbal Subjects for Arabic-to-English SMT”. 17th Conférence sur le Traitement Automatique des Langues Naturelles (TALN; Conference on Natural Language Processing), Montréal, Canada, July 19-22, 2010. Best paper award. Full paper.

We improve our recently proposed technique for integrating Arabic verb-subject constructions in SMT word alignment (Carpuat et al., 2010) by distinguishing between matrix (or main clause) and non-matrix Arabic verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject) constructions are translated in inverted SV order, while non-matrix (subordinate clause) VS constructions are inverted in only half the cases. In addition, while detecting verbs and their subjects is a hard task, our syntactic parser detects VS constructions better in matrix than in non-matrix clauses. As a result, reordering only matrix VS for word alignment consistently improves translation quality over a phrase-based SMT baseline, and over reordering all VS constructions, in both medium- and large-scale settings. In fact, the improvements obtained by reordering matrix VS on the medium-scale setting remarkably represent 44% of the gain in BLEU and 51% of the gain in TER obtained with a word alignment training bitext that is 5 times larger

Yuval Marton, Chris Callison-Burch and Philip Resnik. “Improved Statistical Machine Translation Using Monolingually-derived Paraphrases”. Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, August 6-7, 2009. Full paper.

Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called "low density" languages. But pivoting requires additional parallel texts. We address this problem by deriving paraphrases monolingually, using distributional semantic similarity measures, thus providing access to larger training resources, such as comparable and unrelated monolingual corpora. We present what is to our knowledge the first successful integration of a collocational approach to untranslated words with an end-to-end, state of the art SMT system demonstrating significant translation improvements in a low-resource setting.

Yuval Marton, Saif Mohammad and Philip Resnik. “Estimating Semantic Distance Using Soft Semantic Constraints in Knowledge-Source / Corpus Hybrid Models”. Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, August 6-7, 2009. Full paper.

We propose a corpus–thesaurus hybrid method that uses soft constraints to generate word-sense disambiguated distributional profiles (DPs) from coarser "concept DPs" (derived from a small Roget-like thesaurus) and sense-unaware traditional word DPs (derived from raw text). Not relying on a large lexical resource makes this method suitable also for resource-poorer languages or specific domains. Although it uses a knowledge source, the method is not vocabulary-limited: if the target word is not in the thesaurus, the method falls back gracefully on the word’s co-occurrence information. Experiments on word-pairs ranking by semantic distance show the new hybrid method to be superior to others.

David Chiang, Yuval Marton and Philip Resnik. “Online Large-Margin Training of Syntactic and Structural Translation Features”. Conference on Empirical Methods in Natural Language Processing (EMNLP 2008). Waikiki, Honolulu, Hawaii, October 25-27, 2008. Full paper.

Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrase-based model: first, we simultaneously train a large number of Marton and Resnik’s soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 BLEU on a subset of the NIST 2006 Arabic-English evaluation data.

Yuval Marton and Philip Resnik. “Soft Syntactic Constraints for Hierarchical Phrased-Based Translation”. The 46th Annual Meeting of the Association for Computational Linguistics (ACL). Columbus, Ohio, June 16-18, 2008. Full paper.

In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment. We present an approach that explores the tradeoff from the other direction, starting with a context-free translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language. We obtain substantial improvements in performance for translation from Chinese and Arabic to English.

Yuval Marton, Ning Wu, and Lisa Hellerstein. "On Compression-Based Text Classification". Proceedings of the 27th European Conference on Information Retrieval (ECIR), Spain, March 2005. Abstract. Full paper here or here. Click here for the errata note!

Workshops

John Cadigan and Yuval Marton. “GISA: Giza++ Implementation over Spark by Apache”. AMTA 2016 Semitic Machine Translation Workshop (SeMaT). Austin, Texas, October 28 - November 1, 2016. PDF.

The application of distributed computation to statistical machine translation has been a topic of interest for both research and industry because it allows rapid processing of massive datasets in a reasonable amount of time. Apache Spark, the nascent distributed computation framework, has been found to offer 10 to 100 times speedups for machine learning algorithms compared with the state-of-the-art, Hadoop. We implemented a word alignment tool with IBM Model 1 on a cluster with Spark, yielding 2.2--4.7 times end-to-end speedup over GIZA++ and up to 1.2--1.9 over the multi-threaded MGIZA for mid-size English-French and Arabic-English corpora, with potential to scale further.

Yuval Marton, Nizar Habash, Owen Rambow, and Sarah Alkhulani. “SPMRL’13 Shared Task System: The CADIM Arabic Dependency Parser”. The EMNLP Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), Seattle, WA, October 18, 2013

[Best single parser (non-system-combination parser) on predicted tokenization and part-of-speech tags; second place on gold tokenization.]

We describe the submission from the Columbia Arabic & Dialect Modeling group (CADIM) for the Shared Task at the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL’2013). We participate in the Arabic Dependency parsing task for predicted POS tags and features. Our system is based on Marton et al. (2013).

Yuval Marton, Ahmed El-Kholy and Nizar Habash. “Filtering Antonymous, Trend-Contrasting, and Polarity-Dissimilar Distributional Paraphrases for Improving Statistical Machine Translation”. EMNLP Sixth Workshop on Statistical Machine Translation (WMT), Edinburgh, UK, July 30-31, 2011. PDF

Paraphrases are useful for statistical machine translation (SMT) and natural language processing tasks. Distributional paraphrase generation is independent of parallel texts and syntactic parses, and hence is suitable also for resource-poor languages, but tends to erroneously rank antonyms, trend-contrasting, and polarity-dissimilar candidates as good paraphrases. We present here a novel method for improving distributional paraphrasing by filtering out such candidates. We evaluate it in simulated low and mid-resourced SMT tasks, translating from English to two quite different languages. We show statistically significant gains in English-to-Chinese translation quality, up to 1 BLEU from non-filtered paraphrase-augmented models (1.6 BLEU from baseline). We also show that yielding gains in translation to Arabic, a morphologically rich language, is not straightforward.

Yuval Marton, Nizar Habash, and Owen Rambow. Improving Arabic Dependency Parsing with Inflectional and Lexical Morphological Features. Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL) at Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles, USA, June 1–6, 2010. PDF.

We explore the contribution of different lexical and inflectional morphological features to dependency parsing of Arabic, a morphologically rich language. We experiment with all leading POS tagsets for Arabic, and introduce a few new sets. We show that training the parser using a simple regular expressive extension of an impoverished POS tagset with high prediction accuracy does better than using a highly informative POS tagset with only medium prediction accuracy, although the latter performs best on gold input. Using controlled experiments, we find that definiteness (or determiner presence), the so-called phi-features (person, number, gender), and undiacritzed lemma are most helpful for Arabic parsing on predicted input, while case and state are most helpful on gold.

Chris Dyer, Hendra Setiawan, Yuval Marton, and Philip Resnik. “The University of Maryland Statistical Machine Translation System for the Third Workshop on Machine Translation”. EACL 2009 Fourth Workshop On Statistical Machine Translation, March 2009, Athens, Greece. PDF.

This paper describes the techniques we explored to improve the translation of news text in the German-English and Hungarian-English tracks of the WMT09 shared translation task. Beginning with a convention hierarchical phrase-based system, we found benefits for using word segmentation lattices as input, explicit generation of beginning and end of sentence markers, minimum Bayes risk decoding, and incorporation of a feature scoring the alignment of function words in the hypothesized translation. We also explored the use of monolingual paraphrases to improve coverage, as well as co-training to improve the quality of the segmentation lattices used, but these did not lead to improvements.

Thesis / Dissertation

Yuval Marton. “Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models”. Ph.D. Dissertation, Department of Linguistics, University of Maryland, October 2009. Official format or paper-saving single-space format.

This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints – of syntactic or semantic nature – on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined.

Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains.

A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains.

The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules.

The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints – and potentially other constraints, too – in a single SMT model.

Yuval Marton. “Character-Based and Word-Based Classification: Experiments with Compression Methods and a Word-Based Language Modeling Method”. Master’s thesis, NYU/Poly, CIS Department, 2004.

Text classification is the task of taking a set of input documents that are labeled by category, and using that input information to classify other, unlabeled documents. There are many approaches to text classification. A somewhat non-standard approach is to use compression. […]

Manuscripts and Preprints

Safeyah Khaled Alshemali, Daniel Bauer and Yuval Marton. “Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation”. ArXiv 2024.

The thematic fit estimation task measures the compatibility between a predicate (typically a verb), an argument (typically a noun phrase), and a specific semantic role assigned to the argument. Previous state-of-the-art work has focused on modeling thematic fit through distributional or neural models of event representation, trained in a supervised fashion with indirect labels. In this work, we assess whether pre-trained auto-regressive LLMs possess consistent, expressible knowledge about thematic fit. We evaluate both closed and open state-of-the-art LLMs on several psycholinguistic datasets, along three axes: (1) Reasoning Form: multi-step logical reasoning (chain-of-thought prompting) vs. simple prompting. (2) Input Form: providing context (generated sentences) vs. raw tuples. (3) Output Form: categorical vs. numeric. Our results show that chain-of-thought reasoning is more effective on datasets with self-explanatory semantic role labels, especially Location. Generated sentences helped only in few settings, and lowered results in many others. Predefined categorical (compared to numeric) output raised GPT's results across the board with few exceptions, but lowered Llama's. We saw that semantically incoherent generated sentences, which the models lack the ability to consistently filter out, hurt reasoning and overall performance too. Our GPT-powered methods set new state-of-the-art on all tested datasets.

Mughilan Muthupari, Samrat Halder, Asad Sayeed, Yuval Marton. “Where's the Learning in Representation Learning for Compositional Semantics and the Case of Thematic Fit”. arXiv 2022.

Observing that for certain NLP tasks, such as semantic role prediction or thematic fit estimation, random embeddings perform as well as pre-trained embeddings, we explore what settings allow for this and examine where most of the learning is encoded: the word embeddings, the semantic role embeddings, or “the network”. We find nuanced answers, depending on the task and its relation to the training objective. We examine these representation learning aspects in multi-task learning, where role prediction and role-filling are supervised tasks, while several thematic fit tasks are outside the models' direct supervision. We observe a non-monotonous relation between some tasks' quality score and the training data size. In order to better understand this observation, we analyze these results using easier, per-verb versions of these tasks.

Carol Whitney and Yuval Marton. “The SERIOL2 Model of Orthographic Processing”. June 7, 2013. ERIC Number: ED543279

Yuval Marton. “What Can we Learn about Language Processing and Representation from Word Contour Effects on Letter Order Perception and Word Recognition in Right and Left Visual Fields?” Qualifying paper (Ling895), Department of Linguistics, University of Maryland, May 2007. Manuscript.

NLP Resources and Tools

RW-Eng v2 corpus

Large corpus with morphological analysis, syntax and SRL silver annotation layers. Please email me to request permission, if download fails.

E-TIPSY Corpus (see Yuval Marton and Kristina Toutanova. “E-TIPSY: Search Query Corpus Annotated with Entities, Term Importance, POS Tags, and Syntactic Parses”. The Tenth Edition of the Language Resources and Evaluation Conference (LREC), 23-28 May 2016, Portorož, Slovenia.)

Please email me to request permission, as long as the link above is under construction.

The CATiB Dependency Parser: Columbia University Arabic (MSA) syntactic dependency models with form-based and functional morphological features (see my CL article, 2012)

Please email me (or better: email Owen Rambow, Nizar Habash and me) to request permission, as long as the link above is under construction.

Columbia University Arabic (MSA) syntactic dependency data with form-based and functional morphological features (2011)

With a training / testing split.

Previous versions were used in my NAACL SPMRL (2010) and ACL (2011) publications.

In order to use the data, you need to have BOTH of the following:

1. License from LDC to use the Penn Arabic Treebank part 3 (v3.1)

2. License or written permission from CCLS / Columbia University to use the Arabic functional features (functional gender, functional number and rationality); once you have obtained the LDC license, please email me (or better: email Owen Rambow, Nizar Habash and me) to request the CCLS / Columbia University permission.

Columbia University Arabic (MSA) syntactic dependency data for GALE (2009-2010)

Requires a GALE license; available to GALE participants.

Honors and Awards

Best paper award, 17th Conference on Natural Language Processing (TALN), 2010

Teaching and Mentoring

Masters Capstone Project Mentor. Columbia University Data Science Institute, 2020-2022. University of California Santa Cruz Computer Science and Engineering, 2021. University of Massachusetts Amherst, 2022.

Affiliate Assistant Professor, University of Washington. Machine Translation (Ling575) , Winter 2016; Introduction to Natural Language Processing (CSS590), Spring 2017.

Teaching Assistant, University of Maryland, College Park. Computational Linguistics II (Ling647 / CMSC828R), taught by Philip Resnik, Spring 2006; Introductory Linguistics (Ling200), taught by Tonia Bleam, Spring 2008.

Teaching Assistant, Tel Aviv University. Introduction to Linguistics, taught by Tanya Reinhart (during my undergraduate senior year).

Service: Tutorial, Workshop, and Conference Organization

Associate Editor for the ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). July 2025 to present.

Area Chair, Multilingual Work and Translation. CoNLL 2024 conference, collocated with EMNLP 2024. November 15-16, 2024, Miami, Florida.

Area Editor for the ACL Rolling Review (ARR). 2023 to present.

Co-chair (organizing committee) of the AMTA 2016 Workshop on Semitic Machine Translation (SeMaT). November 1, 2016, Austin TX, USA

Publication Chair of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), September 17–21, Lisboa, Portugal.

Publication Chair of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), October 25–29, Doha, Qatar.

Co-chair of the COLING 2014 First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Language (SPMRL-SANCL 2014), August 23-24 in Dublin, Ireland

Co-chair of the EMNLP 2013 Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2013), and treebank adviser for its shared task. October 18, Seattle, Washington

Publication Chair of the NAACL-HLT 2013 collocated Second Joint Conference on Lexical and Computational Semantics (*SEM), June 9-14, Atlanta, GA

General Co-chair (organizing committee member) of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012), July 12, Jeju Island, Republic of Korea

Publication Chair of the NAACL-HLT 2012 collocated First Joint Conference on Lexical and Computational Semantics (*SEM), June 7-8, Montreal, Canada

Tutorial session: “On-Demand Distributional Paraphrasing”, at NAACL-HLT 2012, June 3, Montreal, Canada

Other Activities

Human translation:

The Future

Under construction! (Permanently)