I am a researcher (a.k.a. research scientist) and consultant in the area of Computational Linguistics / Natural Language Processing and Understanding (NLP/NLU), and more generally, human language technologies (HLT), Machine Learning / Deep Learning (ML/DL), and Artificial Intelligence (AI). I am also an affiliate professor at the University of Washington. I have an entrepreneurial side as well, and have worked at both startups and large corporations (IBM, Microsoft, and now Morgan Stanley).
My research interests include statistical machine translation, parsing, and lexical semantics -- corpus-based semantic similarity measures, and paraphrase generation. More recently I got involved in dialog systems / bots / NLU (for sales / task completion) and information retrieval (search engine result ranking). I am interested in using and adapting machine learning methods, including deep learning for NLP/NLU – and using linguistically informed learning bias and feature design to make such ML-with-NLP methods more effective.
Email: yuvalmarton @t gmail.com
Apart from my position in the "industry", I am an affiliate professor at the University of Washington, teaching NLP/MT courses, and advising graduate students.
I was a Post-Doctoral Research Scientist at the Columbia University Center for Computational Learning Systems (CCLS), where I worked with Nizar Habash and Owen Rambow on syntactic parsing, focusing on Arabic parsing for statistical machine translation (SMT), including subject detection, and morphological features for parsing.
I received my Ph.D. in linguistics from University of Maryland (UMD) in 2009. My advisors were Philip Resnik and Amy Weinberg, and my focus was on computational linguistics. My dissertation, entitled “Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models”, explored using soft syntactic and semantic constraints in end-to-end state-of-the-art statistical machine translation systems. It also introduced a novel distributional paraphrase generation technique that can benefit from soft semantic constraints, and can generate paraphrases of arbitrary length on-the-fly, with dynamically set context length. Last, my dissertation presented a generalized framework, of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined.
Following my interests in neuro-biologically plausible cognitive and linguistic models, I took several fascinating neuroscience courses at the Neuroscience and Cognitive Science (NACS) Program, and received the NACS Certificate. My qualifying paper focused on visual word recognition. I argued there for a lexical representation that consists of both lower-level visual features and higher-level abstract letter objects, interacting with statistical factors (word frequency) and partly innate factors (left or right visual field perception). During the second half of my studies, I did research in this area with Carol Whitney.
Further back in time, I took part in text classification research (authorship attribution and topic / genre classification). My previous-previous advisor was Lisa Hellerstein, back when I was a computer science graduate student at the Polytechnic Institute of NYU (formerly Polytechnic University, Brooklyn, NY), where I received my Computer Science Masters.
See my Google Scholar profile and/or the list below:
Yuval Marton, Imed Zitouni. “Transliteration normalization for Information Extraction and Machine Translation”. Journal of King Saud University - Computer and Information Sciences. Volume 26, Issue 4, December 2014, Pages 379–387. DOI: 10.1016/j.jksuci.2014.06.011
Foreign name transliterations typically include multiple spelling
variants. These variants cause data sparseness and inconsistency problems,
increase the Out-of-Vocabulary (OOV) rate, and present challenges for Machine
Translation, Information Extraction and other natural language processing (NLP)
tasks. This work aims to identify and cluster name spelling variants using a Statistical
Machine Translation method: word alignment. The variants are identified by
being aligned to the same “pivot” name in another language (the source-language
in Machine Translation settings). Based on word-to-word translation and
transliteration probabilities, as well as the string edit distance metric,
names with similar spellings in the target language are clustered and then
normalized to a canonical form. With this approach, tens of thousands of
high-precision name transliteration spelling variants are extracted from
sentence-aligned bilingual corpora in Arabic and English (in both languages).
When these normalized name spelling variants are applied to Information
Extraction tasks, improvements over strong baseline systems are observed. When
applied to Machine Translation tasks, a large improvement potential is shown.
Mona Diab and Yuval Marton. “Semantic Processing of Semitic Languages”. Book chapter, in “Semitic Language Processing”, Imed Zitouni, Ed. Springer, March 2014, pp 129-159. ISBN: 978-3-642-45357-1 (Print) 978-3-642-45358-8 (Online)
In this chapter, we cover semantic processing in Semitic languages.
We will present models of semantic processing over words and their relations in
sentences, namely paradigmatic and syntagmatic models. We will contrast the
processing of Semitic languages against English, illustrating some of the
challenges–and clues–due to the inherent unique characteristics of Semitic
Yuval Marton. “Distributional Phrasal Paraphrase Generation for Statistical Machine Translation”. ACM Transactions on Intelligent Systems and Technology (TIST) special issue on paraphrasing. Eds.: Haifeng Wang, Bill Dolan, Idan Szpektor, Shiqi Zhao. Volume 4, Issue 3, June 2013.
Paraphrase generation has been shown useful for various natural
language processing tasks, including statistical machine translation. A
commonly used method for paraphrase generation is pivoting [Callison-Burch et
al. 2006], which benefits from linguistic knowledge implicit in the sentence
alignment of parallel texts, but has limited applicability due to its reliance
on parallel texts. Distributional paraphrasing [Marton et al. 2009] has wider
applicability, is more language-independent, but doesn’t benefit from any
linguistic knowledge. Nevertheless, we show that distributional paraphrasing
can yield greater gains. We report method improvements leading to higher gains
than previously published – almost 2 BLEU points, and provide implementation
details, complexity analysis, and further insight into this method.
Yuval Marton, Nizar Habash and Owen Rambow. “Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features”. Computational Linguistics, Volume 39, Issue 1, pages 161-194. Posted Online March 1, 2013.
We explore the contribution of lexical and inflectional morphology
features to dependency parsing of Arabic, a morphologically rich language with
complex agreement patterns. Using controlled experiments, we contrast the
contribution of different part-of-speech (POS) tagsets and morphological
features in two input conditions: machine-predicted condition (in which POS tags
and morphological feature values are automatically assigned), and gold
condition (in which their true values are known).We find that more informative
(fine-grained) tagsets are useful in the gold condition, but may be detrimental
in the predicted condition, where they are outperformed by simpler but more
accurately predicted tagsets. We identify a set of features (definiteness, person,
number, gender, and undiacritized lemma) that improve parsing quality in the
predicted condition, while other features are more useful in gold. We are the
first to show that functional features for gender and number (e.g., “broken
plurals”), and optionally the related rationality (“humanness”) feature, are
more helpful for parsing than form-based gender and number. We finally show
that parsing quality in the predicted condition can dramatically improve by training
in a combined gold+predicted condition. We experimented with two
transition-based parsers, MaltParser and Easy-First Parser. Our findings are
robust across parsers, models and input conditions. This suggests that the
contribution of the linguistic knowledge in the tagsets and features we
identified goes beyond particular experimental settings, and may be informative
for other parsers and morphologically rich languages.
Marianna Apidiansky, Ido Dagan, Jennifer Foster, Yuval Marton, Djame Seddah, Reut Tsarfaty. "Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages" (SP-Sem-MRL). Proceedings of the 50th meeting of the Association for Computational Linguistics (ACL), Jeju Island, Korea, July 12, 2012. PDF
Morphologically Rich Languages (MRLs) are languages in which
grammatical relations such as Subject, Predicate, and Object, are largely
indicated morphologically (e.g., through inflection) instead of positionally.
This poses serious challenges for current (English-centric) syntactic and
semantic processing. Furthermore, since grammatical relations provide the
interface to compositional semantics, morpho-syntactic phenomena may
significantly complicate processing the syntax–semantics interface. In
statistical parsing, English parsing performance has reached a high plateau in
certain genres. Semantic processing of English has similarly seen much progress
in recent years. MRL processing presents new challenges, such as optimal
morphological representation, non position-centric algorithms, or different
semantic distance measures. …
Yuval Marton, David Chiang, and Philip Resnik. “Soft Syntactic Constraints for Arabic-English Hierarchical Phrase-Based Translation”. Machine Translation Journal Special Issues on Machine Translation for Arabic. Editor-in-Chief: Andy Way, Guest Co-Editors: Nizar Habash and Hany Hassan. Paginated version: Volume 26, Issue 1 (2012), pages 137-157. Online version: Journal no. 10590, 29 October 2011
In adding syntax to statistical machine translation, there is a
tradeoff between taking advantage of linguistic analysis and allowing the model
to exploit parallel training data with no linguistic analysis: translation
quality versus coverage. A number of previous efforts have tackled this
tradeoff by starting with a commitment to linguistically motivated analyses and
then finding appropriate ways to soften that commitment. We present an approach
that explores the tradeoff from the other direction, starting with a
translation model learned directly from aligned parallel text, and then adding
soft constituent-level constraints based on parses of the source language. We
argue that in order for these constraints to improve translation, they must be
fine-grained: the constraints should vary by constituent type, and by the type
of match or mismatch with the parse. We also use a different feature weight
optimization technique, capable of handling large amount of features, thus
eliminating the bottleneck of feature selection. We obtain substantial
improvements in performance for translation from Arabic to English.
Marine Carpuat, Yuval Marton, and Nizar Habash. “Reordering Post-verbal Subjects for Arabic-to-English Statistical Machine Translation”. Machine Translation Journal Special Issues on Machine Translation for Arabic. Editor-in-Chief: Andy Way, Guest Co-Editors: Nizar Habash and Hany Hassan. Paginated version: Volume 26, Issue 1 (2012), pages 105-120. Online version: Journal no. 10590, 8 November 2011
We study challenges raised by the order of Arabic verbs and their
subjects in Statistical Machine Translation (SMT). We show that the boundaries
of post-verbal subjects (VS) are hard to detect accurately, even with a
state-of-the-art Arabic dependency parser. In addition, VS constructions have
highly ambiguous reordering patterns when translated to English, and these
patterns are very different for matrix (main clause) VS and non-matrix
(subordinate clause) VS. Based on this analysis, we propose a novel method for
leveraging VS information in SMT: we reorder VS constructions into SV order for
word alignment. Unlike in previous approaches to source-side reordering, phrase
extraction and decoding are performed using the original Arabic word order.
This strategy significantly improves BLEU and TER scores, even on a strong large-scale
baseline. Limiting reordering to matrix VS yields further improvements.
Yuval Marton, Ning Wu, and Lisa Hellerstein. "On Compression-Based Text Classification". Advances in Information Retrieval, Lecture Notes in Computer Science, Volume 3408, 2005, pages 300-314. Abstract. Full paper here or here. Click here for the errata note!
Compression-based text classification methods are easy to
apply, requiring virtually no preprocessing of the data. Most such methods are
character-based, and thus have the potential to automatically capture non-word
features of a document, such as punctuation, word-stems, and features spanning
more than one word. However, compression-based classification methods have
drawbacks (such as slow running time), and not all such methods are equally effective.
We present the results of a number of experiments designed to evaluate the
effectiveness and behavior of different compression-based text classification
methods on English text. Among our experiments are some specifically designed
to test whether the ability to capture non-word (including super-word) features
causes character-based text compression methods to achieve more accurate
See my ECIR paper below.
Yuval Marton and Kristina Toutanova. “E-TIPSY: Search Query Corpus Annotated with Entities, Term Importance, POS Tags, and Syntactic Parses”. The Tenth Edition of the Language Resources and Evaluation Conference (LREC), 23-28 May 2016, Portorož (Slovenia).
We present E-TIPSY, a search query corpus annotated with named Entities, Term Importance, POS tags, and SYntactic parses. This corpus contains crowdsourced (gold) annotations of the three most important terms in each query. In addition, it contains automatically produced annotations of named entities, part-of-speech tags, and syntactic parses for the same queries. This corpus comes in two formats: (1) Sober Subset: annotations that two or more crowd workers agreed upon, and (2) Full Glass: all annotations. We analyze the strikingly low correlation between term importance and syntactic headedness, which invites research into effective ways of combining these different signals. Our corpus can serve as a benchmark for term importance methods aimed at improving search engine quality and as an initial step toward developing a dataset of gold linguistic analysis of web search queries. In addition, it can be used as a basis for linguistic inquiries into the kind of expressions used in search
Junhui Li, Yuval Marton, Hal Daume III, and Philip Resnik. “A Unified Model for Soft Linguistic Reordering Constraints in Statistical Machine Translation”. The 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, June 25-29, 2014. Online version.
This paper explores a simple and effective unified framework for incorporating soft linguistic reordering constraints into a hierarchical phrase-based translation system: 1) a syntactic reordering model that explores reordering for context free grammar rules; and 2) a semantic reordering model that focuses on the reordering of predicate-argument structures. We develop novel features based on both models and use them as soft constraints to guide the translation process. Experiments on Chinese-English translation show that the reordering approach can significantly improve a state-of-the-art hierarchical phrase-based translation system. However, the gain achieved by the semantic reordering model is limited in the presence of the syntactic reordering model, and we therefore provide a detailed analysis of the behavior differences between the two.
Vladimir Eidelman, Yuval Marton, and Philip Resnik. “Online Relative Margin Maximization for Statistical Machine Translation”. The 51st Annual Meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, August 4-9, 2013.
Recent advances in large-margin learning have shown that better
generalization can be achieved by incorporating higher order information into
the optimization, such as the spread of the data. However, these solutions are
impractical in complex structured prediction problems such as statistical
machine translation. We present an online gradient-based algorithm for relative
margin maximization, which bounds the spread of the projected data while
maximizing the margin. We evaluate our optimizer on Chinese-English and
Arabic-English translation tasks, each with small and large feature sets, and
show that our learner is able to achieve significant improvements of up to 1.5
BLEU and 5.9 TER over state-of-the-art optimizers.
Yuval Marton, Nizar Habash and Owen Rambow. “Improving Arabic Dependency Parsing with Lexical and Inflectional Surface and Functional Features”. The 49th Annual Meeting of the Association for Computational Linguistics (ACL), Portland, Oregon, USA, June 19-24, 2011. Full paper.
We explore the contribution of lexical and morphological features to dependency parsing of Arabic, a morphologically rich language. Using controlled experiments, we find that definiteness, person, number, gender, and un-diacritzed lemma are most helpful for parsing on automatically tagged input. We further contrast the contribution of surface and functional features, and show that functional features for gender and number (e.g., “broken plurals”) and the related rationality feature improve over surface-based features. It is the first time these functional features are used for Arabic NLP.
Hao Li, Xiang Li, Heng Ji, and Yuval Marton. “Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation”. The 24th Pacific Asia Conference on Language, Information and Computation (PACLIC), Sendai, Japan, November 4-7, 2010. Full paper.
Information Extraction (IE) is becoming increasingly useful, but it is a costly task to discover and annotate novel events, event arguments, and event types. We exploit both monolingual texts and bilingual sentence-aligned parallel texts to cluster event triggers and discover novel event types. We then generate event argument annotations semi-automatically, framed as a sentence ranking and semantic role labeling task. Experiments on three different corpora -- ACE, OntoNotes and a collection of scientific literature -- have demonstrated that our domain-independent methods can significantly speed up the entire event discovery and annotation process while maintaining high quality.
Yuval Marton. “Improved Statistical Machine Translation Using Monolingual Text and a Shallow Lexical Resource for Hybrid Phrasal Paraphrase Generation. The Ninth Conference of the Association for Machine Translation in the Americas (AMTA), Denver, Colorado, October 31 – November 5, 2010. Full paper.
Paraphrase generation is useful for various NLP tasks. But pivoting techniques for paraphrasing have limited applicability due to their reliance on parallel texts, although they benefit from linguistic knowledge implicit in the sentence alignment. Distributional paraphrasing has wider applicability, but doesn’t benefit from any linguistic knowledge. We combine a distributional semantic distance measure (based on a non-annotated corpus) with a shallow linguistic resource to create a hybrid semantic distance measure of words, which we extend to phrases. We embed this extended hybrid measure in a distributional paraphrasing technique, benefiting from both linguistic knowledge and independence from parallel texts. Evaluated in statistical machine translation tasks by augmenting translation models with paraphrase-based translation rules, we show our novel technique is superior to the non-augmented baseline and both the distributional and pivot paraphrasing techniques. We train models on both a full-size dataset as well as a simulated “low density” small dataset.
Marine Carpuat, Yuval Marton, and Nizar Habash. “Improving Arabic-to-English Statistical Machine Translation by Reordering Post-verbal Subjects for Alignment”. The 48th Annual Meeting of the Association for Computational Linguistics (ACL), Uppsala, Sweden, July 11–16, 2010. Short paper.
We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.
Marine Carpuat, Yuval Marton, Nizar Habash. “Reordering Matrix Post-verbal Subjects for Arabic-to-English SMT”. 17th Conférence sur le Traitement Automatique des Langues Naturelles (TALN; Conference on Natural Language Processing), Montréal, Canada, July 19-22, 2010. Best paper award. Full paper.
We improve our recently proposed technique for integrating Arabic
verb-subject constructions in SMT word alignment (Carpuat et al., 2010) by
distinguishing between matrix (or main clause) and non-matrix Arabic
verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject)
constructions are translated in inverted SV order, while non-matrix
(subordinate clause) VS constructions are inverted in only half the cases. In
addition, while detecting verbs and their subjects is a hard task, our
syntactic parser detects VS constructions better in matrix than in non-matrix
clauses. As a result, reordering only matrix VS for word alignment consistently
improves translation quality over a phrase-based SMT baseline, and over
reordering all VS constructions, in both medium- and large-scale settings. In
fact, the improvements obtained by reordering matrix VS on the medium-scale
setting remarkably represent 44% of the gain in BLEU and 51% of the gain in TER
obtained with a word alignment training bitext that is 5 times larger
Yuval Marton, Chris Callison-Burch and Philip Resnik. “Improved Statistical Machine Translation Using Monolingually-derived Paraphrases”. Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, August 6-7, 2009. Full paper.
Untranslated words still constitute a major problem for Statistical
Machine Translation (SMT), and current SMT systems are limited by the quantity
of parallel training texts. Augmenting the training data with paraphrases
generated by pivoting through other languages alleviates this problem,
especially for the so-called "low density" languages. But pivoting
requires additional parallel texts. We address this problem by deriving
paraphrases monolingually, using distributional semantic similarity measures,
thus providing access to larger training resources, such as comparable and
unrelated monolingual corpora. We present what is to our knowledge the first
successful integration of a collocational approach to untranslated words with
an end-to-end, state of the art SMT system demonstrating significant
translation improvements in a low-resource setting.
Yuval Marton, Saif Mohammad and Philip Resnik. “Estimating Semantic Distance Using Soft Semantic Constraints in Knowledge-Source / Corpus Hybrid Models”. Conference on Empirical Methods in Natural Language Processing (EMNLP). Singapore, August 6-7, 2009. Full paper.
We propose a corpus–thesaurus hybrid method that uses soft
constraints to generate word-sense disambiguated distributional profiles (DPs)
from coarser "concept DPs" (derived from a small Roget-like
thesaurus) and sense-unaware traditional word DPs (derived from raw text). Not
relying on a large lexical resource makes this method suitable also for
resource-poorer languages or specific domains. Although it uses a knowledge
source, the method is not vocabulary-limited: if the target word is not in the
thesaurus, the method falls back gracefully on the word’s co-occurrence information.
Experiments on word-pairs ranking by semantic distance show the new hybrid
method to be superior to others.
David Chiang, Yuval Marton and Philip Resnik. “Online Large-Margin Training of Syntactic and Structural Translation Features”. Conference on Empirical Methods in Natural Language Processing (EMNLP 2008). Waikiki, Honolulu, Hawaii, October 25-27, 2008. Full paper.
Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrase-based model: first, we simultaneously train a large number of Marton and Resnik’s soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 BLEU on a subset of the NIST 2006 Arabic-English evaluation data.
Yuval Marton and Philip Resnik. “Soft Syntactic Constraints for Hierarchical Phrased-Based Translation”. The 46th Annual Meeting of the Association for Computational Linguistics (ACL). Columbus, Ohio, June 16-18, 2008. Full paper.
In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment. We present an approach that explores the tradeoff from the other direction, starting with a context-free translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language. We obtain substantial improvements in performance for translation from Chinese and Arabic to English.
Yuval Marton, Ning Wu, and Lisa Hellerstein. "On Compression-Based Text Classification". Proceedings of the 27th European Conference on Information Retrieval (ECIR), Spain, March 2005. Abstract. Full paper here or here. Click here for the errata note!
Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.
The application of distributed computation to statistical machine
translation has been a topic of interest for both research and industry because
it allows rapid processing of massive datasets in a reasonable amount of time.
Apache Spark, the nascent distributed computation framework, has been found to
offer 10 to 100 times speedups for machine learning algorithms compared with
the state-of-the-art, Hadoop. We implemented a word alignment tool with IBM
Model 1 on a cluster with Spark, yielding 2.2--4.7 times end-to-end speedup
over GIZA++ and up to 1.2--1.9 over the multi-threaded MGIZA for mid-size
English-French and Arabic-English corpora, with potential to scale further.
Yuval Marton, Nizar Habash, Owen Rambow, and Sarah Alkhulani. “SPMRL’13 Shared Task System: The CADIM Arabic Dependency Parser”. The EMNLP Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), Seattle, WA, October 18, 2013
[Best single parser (non-system-combination parser) on predicted tokenization and part-of-speech tags; second place on gold tokenization.]
We describe the submission from the Columbia Arabic & Dialect Modeling group (CADIM) for the Shared Task at the Fourth Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL’2013). We participate in the Arabic Dependency parsing task for predicted POS tags and features. Our system is based on Marton et al. (2013).
Yuval Marton, Ahmed El-Kholy and Nizar Habash. “Filtering Antonymous, Trend-Contrasting, and Polarity-Dissimilar Distributional Paraphrases for Improving Statistical Machine Translation”. EMNLP Sixth Workshop on Statistical Machine Translation (WMT), Edinburgh, UK, July 30-31, 2011. PDF
Paraphrases are useful for statistical machine translation (SMT)
and natural language processing tasks. Distributional paraphrase generation is
independent of parallel texts and syntactic parses, and hence is suitable also
for resource-poor languages, but tends to erroneously rank antonyms,
trend-contrasting, and polarity-dissimilar candidates as good paraphrases. We
present here a novel method for improving distributional paraphrasing by
filtering out such candidates. We evaluate it in simulated low and
mid-resourced SMT tasks, translating from English to two quite different
languages. We show statistically significant gains in English-to-Chinese
translation quality, up to 1 BLEU from non-filtered paraphrase-augmented models
(1.6 BLEU from baseline). We also show that yielding gains in translation to
Arabic, a morphologically rich language, is not straightforward.
Yuval Marton, Nizar Habash, and Owen Rambow. Improving Arabic Dependency Parsing with Inflectional and Lexical Morphological Features. Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL) at Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles, USA, June 1–6, 2010. PDF.
We explore the contribution of different lexical and inflectional
morphological features to dependency parsing of Arabic, a morphologically rich language.
We experiment with all leading POS tagsets for Arabic, and introduce a few new
sets. We show that training the parser using a simple regular expressive
extension of an impoverished POS tagset with high prediction accuracy does
better than using a highly informative POS tagset with only medium prediction
accuracy, although the latter performs best on gold input. Using controlled
experiments, we find that definiteness (or determiner presence), the so-called
phi-features (person, number, gender), and undiacritzed lemma are most helpful
for Arabic parsing on predicted input, while case and state are most helpful on
Chris Dyer, Hendra Setiawan, Yuval Marton, and Philip Resnik. “The University of Maryland Statistical Machine Translation System for the Third Workshop on Machine Translation”. EACL 2009 Fourth Workshop On Statistical Machine Translation, March 2009, Athens, Greece. PDF.
This paper describes the techniques we explored to improve the translation of news text in the German-English and Hungarian-English tracks of the WMT09 shared translation task. Beginning with a convention hierarchical phrase-based system, we found benefits for using word segmentation lattices as input, explicit generation of beginning and end of sentence markers, minimum Bayes risk decoding, and incorporation of a feature scoring the alignment of function words in the hypothesized translation. We also explored the use of monolingual paraphrases to improve coverage, as well as co-training to improve the quality of the segmentation lattices used, but these did not lead to improvements.
Yuval Marton. “Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models”. Ph.D. Dissertation, Department of Linguistics, University of Maryland, October 2009. Official format or paper-saving single-space format.
This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints – of syntactic or semantic nature – on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined.
Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains.
A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains.
The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules.
The work reported here is the first to use distributional semantic
similarity measures to improve performance of an end-to-end phrase-based SMT
system. The unified framework for statistical NLP models with soft linguistic
constraints enables, in principle, the combination of both semantic and
syntactic constraints – and potentially other constraints, too – in a single
Yuval Marton. “Character-Based and Word-Based Classification: Experiments with Compression Methods and a Word-Based Language Modeling Method”. Master’s thesis, NYU/Poly, CIS Department, 2004.
Text classification is the task of taking a set of input documents that are labeled by category, and using that input information to classify other, unlabeled documents. There are many approaches to text classification. A somewhat non-standard approach is to use compression. […]
Yuval Marton. “What Can we Learn about Language Processing and Representation from Word Contour Effects on Letter Order Perception and Word Recognition in Right and Left Visual Fields?” Qualifying paper (Ling895), Department of Linguistics, University of Maryland, May 2007. Manuscript.
E-TIPSY Corpus (see Yuval Marton and Kristina Toutanova. “E-TIPSY: Search Query Corpus Annotated with Entities, Term Importance, POS Tags, and Syntactic Parses”. The Tenth Edition of the Language Resources and Evaluation Conference (LREC), 23-28 May 2016, Portorož, Slovenia.)
Please email me to request permission, as long as the link above is under construction.
The CATiB Dependency Parser: Columbia University Arabic (MSA) syntactic dependency models with form-based and functional morphological features (see my CL article, 2012)
Please email me (or better: email Owen Rambow, Nizar Habash and me) to request permission, as long as the link above is under construction.
Columbia University Arabic (MSA) syntactic dependency data with form-based and functional morphological features (2011)
With a training / testing split.
Previous versions were used in my NAACL SPMRL (2010) and ACL (2011) publications.
In order to use the data, you need to have BOTH of the following:
1. License from LDC to use the Penn Arabic Treebank part 3 (v3.1)
2. License or written permission from CCLS / Columbia University to use the Arabic functional features (functional gender, functional number and rationality); once you have obtained the LDC license, please email me (or better: email Owen Rambow, Nizar Habash and me) to request the CCLS / Columbia University permission.
Columbia University Arabic (MSA) syntactic dependency data for GALE (2009-2010)
Requires a GALE license; available to GALE participants.
Affiliate Assistant Professor, University of Washington. Machine Translation (Ling575) , Winter 2016; Introduction to Natural Language Processing (CSS590), Spring 2017.
Teaching Assistant, University of Maryland, College Park. Computational Linguistics II (Ling647 / CMSC828R), taught by Philip Resnik, Spring 2006; Introductory Linguistics (Ling200), taught by Tonia Bleam, Spring 2008.
Co-chair (organizing committee) of the AMTA 2016 Workshop on Semitic Machine Translation (SeMaT). November 1, 2016, Austin TX, USA
Publication Chair of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), September 17–21, Lisboa, Portugal.
Publication Chair of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), October 25–29, Doha, Qatar.
Co-chair of the COLING 2014 First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Language (SPMRL-SANCL 2014), August 23-24 in Dublin, Ireland
Co-chair of the EMNLP 2013 Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2013), and treebank adviser for its shared task. October 18, Seattle, Washington
Publication Chair of the NAACL-HLT 2013 collocated Second Joint Conference on Lexical and Computational Semantics (*SEM), June 9-14, Atlanta, GA
General Co-chair (organizing committee member) of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages (SP-Sem-MRL 2012), July 12, Jeju Island, Republic of Korea
Publication Chair of the NAACL-HLT 2012 collocated First Joint Conference on Lexical and Computational Semantics (*SEM), June 7-8, Montreal, Canada