« Judging covers by the book | Home | A note on similarity search »

Back in business

Sun 18 Jul 2010 by mskala Tags used: , ,

I just got back from a week in Sweden at ACL 2010. It went pretty well. I presented my paper, which you can read online as a PDF; I didn't get a lot of response or questions right at the presentation, but I said what I wanted to say and at least they didn't throw things, and I'm told there was a lot of interest in it offline. In one of the workshops I actually found several people who wanted to talk to me about my dissertation research, and it'd be really cool if I could somehow redeem some of the years of work I put into that, so that's good. I took photos and some may end up posted here eventually.

Eddy, if you read this I hope you aren't offended I didn't try to visit you while in Sweden. I was making my travel arrangements under a bunch of constraints and I had a lot of obligations to my employers for how I'd use my time during the trip.

I got many ideas for new research to do. Here's one that should be fairly accessible to my Web log readers: adjective ordering, and especially whether it's transitive.

By way of background, the original Japanese version of the light novel that in English is called The Melancholy of Haruhi Suzumiya contains an occurrence of the noun phrase 「黒のミニタイトスカート」 (kuro no mini taito sukaato), which literally translates to something like "*black mini- tight skirt," except that in English we would never ever say that. It sounds laughably incorrect. Note that "mini," "tight," and "skirt" are all loan-words from English. But in English it would have to be a "tight black mini-skirt." Now, that's partly because the literal version is ungrammatical in English; "mini-" isn't an adjective, it's a prefix that becomes part of the noun, whereas "black" and "tight" are adjectives that don't bind so tightly to the noun, so we have to put "mini-" last. In Japanese, on the other hand, 「黒」meaning "black" is a noun, not an adjective; it's used with a particle that changes it into something kind of like an adjective, and that may be a reason for it come first. These issues are relatively easy to deal with because they flow from the syntax.

However, "*black tight mini-skirt" is wrong in English too; to be really correct, "tight" must come before "black" (though you can bend this rule by means of commas) and to get it right you need to keep track of semantic information (information about meanings) beyond the basic syntax: colour adjectives like "black" come before adjectives of whatever semantic kind "tight" is.

That much is fairly well known and the people who study such things have written a lot about it. It's reasonable to suppose, and one of the talks I heard at ACL described this, that there should be a master list of adjectives somewhere and they always occur in that order. But is that really true? There may be classes of adjectives within which order doesn't matter; but what's worse, there may be adjectives for which there is a preferred order but it's intransitive! I mean, maybe there are adjectives A, B, and C where proficient native speakers would prefer to say "A B noun" not "B A noun"; "B C noun" not "C B noun"; but "C A noun" not "A C noun."

One of my colleagues ran a fast experiment during a coffee break at the conference and found a lot of triples with that property as far as occurrences go - where you are allowed to say "A B", "B C", and "C A." That doesn't close the question, though, because most of them seem to be situations where the three adjectives are all in the same class and don't really have a preferred ordering at all. I'm more interested in finding cases where the preferred ordering violates transitivity. Even if there are no such cases, showing that transitivity does apply would be interesting too.

Of course it's quite possible that others have already looked at this; my first step in studying it would be to look for whether the answer has already been found and published. Something else I'd like to look into is to what extent there's a preferred ordering (semantics-based, not just syntactic) in other languages, especially including Japanese. My tutor told me there isn't such an ordering in Japanese, but I have my doubts about that.

It is my theory that paper titles of the form "SystemNameAcronym: A Complicated Noun Phrase" correlate pretty well with what let's call baloney. I'm a co-author of one paper like that myself ("SEIDAM: A flexible and interoperable metadata-driven system for intelligent forest monitoring") and there are good reasons why people do use them; it's not something that absolutely must be avoided. Nonetheless, here without further comment is a list of the titles of papers presented at the 5th International Workshop on Semantic Evaluation, as extracted from their BIBTeX file.

  • 273. Task 5. Keyphrase Extraction Based on Core Word Identification and Word Expansion
  • 372:Comparing the Benefit of Different Dependency Parsers for Textual Entailment Using Syntactic Constraints Only
  • BART: A Multilingual Anaphora Resolution System
  • BUAP: An Unsupervised Approach to Automatic Keyphrase Extraction from Scientific Articles
  • CFILT: Resource Conscious Approaches for All-Words Domain Specific WSD
  • CLR: Linking Events and Their Participants in Discourse Using a Comprehensive FrameNet Dictionary
  • COLEPL and COLSLM: An Unsupervised WSD Approach to Multilingual Lexical Substitution, Tasks 2 and 3 SemEval 2010
  • Cambridge: Parser Evaluation Using Textual Entailment by Grammatical Relation Comparison
  • CityU-DAC: Disambiguating Sentiment-Ambiguous Adjectives within Context
  • Combining Dictionaries and Contextual Information for Cross-Lingual Lexical Substitution
  • Corry: A System for Coreference Resolution
  • DERIUNLP: A Context Based Approach to Automatic Keyphrase Extraction
  • DFKI KeyWE: Ranking Keyphrases Extracted from Scientific Articles
  • Duluth-WSI: SenseClusters Applied to the Sense Induction Task of SemEval-2
  • ECNU: Effective Semantic Relations Classification without Complicated Features or Multiple External Corpora
  • Edinburgh-LTG: TempEval-2 System Description
  • FBK-IRST: Semantic Relation Extraction Using Cyc
  • FBK_NK: A WordNet-Based System for Multi-Way Classification of Semantic Relations
  • FCC: Modeling Probabilities with GIZA++ for Task 2 and 3 of SemEval-2
  • GPLSI-IXA: Using Semantic Classes to Acquire Monosemous Training Examples from Domain Texts
  • HERMIT: Flexible Clustering for the SemEval-2 WSI Task
  • HIT-CIR: An Unsupervised {WSD} System Based on Domain Most Frequent Sense Estimation
  • HITSZ_CITYU: Combine Collocation, Context Words and Neighboring Sentence Sentiment in Sentiment Adjectives Disambiguation
  • HR-WSD: System Description for All-Words Word Sense Disambiguation on a Specific Domain at SemEval-2010
  • HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID
  • HeidelTime: High Quality Rule-Based Extraction and Normalization of Temporal Expressions
  • ID 392:TERSEO + T2T3 Transducer. A systems for Recognizing and Normalizing TIMEX3
  • IIITH: Domain Specific Word Sense Disambiguation
  • ISI: Automatic Classification of Relations Between Nominals Using a Maximum Entropy Classifier
  • ISTI@SemEval-2 Task 8: Boosting-Based Multiway Relation Classification
  • JAIST: Clustering and Classification Based Approaches for Japanese WSD
  • JU: A Supervised Approach to Identify Semantic Relations from Paired Nominals
  • JU_CSE_TEMP: A First Step towards Evaluating Events, Time Expressions and Temporal Relations
  • KCDC: Word Sense Induction by Using Grammatical Dependencies and Sentence Phrase Structure
  • KP-Miner: Participation in SemEval-2
  • KSU KDD: Word Sense Induction by Clustering in Topic Space
  • KUL: Recognition and Normalization of Temporal Expressions
  • KX: A Flexible System for Keyphrase eXtraction
  • Kyoto: An Integrated System for Specific Domain WSD
  • Likey: Unsupervised Language-Independent Keyphrase Extraction
  • MARS: A Specialized RTE System for Parser Evaluation
  • MSS: Investigating the Effectiveness of Domain Combinations and Topic Features for Word Sense Disambiguation
  • NCSU: Modeling Temporal Relations with Markov Logic and Lexical Ontology
  • OWNS: Cross-lingual Word Sense Disambiguation Using Weighted Overlap Counts and Wordnet Based Similarity Measures
  • OpAL: Applying Opinion Mining Techniques for the Disambiguation of Sentiment Ambiguous Adjectives in SemEval-2 Task 18
  • PKU_HIT: An Event Detection System Based on Instances Expansion and Rich Syntactic Features
  • PengYuan@PKU: Extracting Infrequent Sense Instance with the Same N-Gram Pattern for the SemEval-2010 Task 15
  • Proceedings of the 5th International Workshop on Semantic Evaluation
  • RACAI: Unsupervised WSD Experiments @ SemEval-2, Task 17
  • RALI: Automatic Weighting of Text Window Distances
  • RelaxCor: A Global Relaxation Labeling Approach to Coreference Resolution
  • SCHWA: PETE Using CCG Dependencies with the C\&C Parser
  • SEERLAB: A System for Extracting Keyphrases from Scholarly Documents
  • SEMAFOR: Frame Argument Resolution with Log-Linear Models
  • SJTULTLAB: Chunk Based Method for Keyphrase Extraction
  • SUCRE: A Modular System for Coreference Resolution
  • SWAT: Cross-Lingual Lexical Substitution using Local Context Matching, Bilingual Dictionaries and Machine Translation
  • SZTERGAK : Feature Engineering for Keyphrase Extraction
  • SemEval-2 Task 15: Infrequent Sense Identification for Mandarin Text to Speech Systems
  • SemEval-2 Task 9: The Interpretation of Noun Compounds Using Paraphrasing Verbs and Prepositions
  • SemEval-2010 Task 10: Linking Events and Their Participants in Discourse
  • SemEval-2010 Task 11: Event Detection in Chinese News Sentences
  • SemEval-2010 Task 12: Parser Evaluation Using Textual Entailments
  • SemEval-2010 Task 13: TempEval-2
  • SemEval-2010 Task 14: Word Sense Induction \& Disambiguation
  • SemEval-2010 Task 17: All-Words Word Sense Disambiguation on a Specific Domain
  • SemEval-2010 Task 18: Disambiguating Sentiment Ambiguous Adjectives
  • SemEval-2010 Task 1: Coreference Resolution in Multiple Languages
  • SemEval-2010 Task 2: Cross-Lingual Lexical Substitution
  • SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation
  • SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles
  • SemEval-2010 Task 7: Argument Selection and Coercion
  • SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals
  • SemEval-2010 Task: Japanese WSD
  • Single Document Keyphrase Extraction Using Sentence Clustering and Latent Dirichlet Allocation
  • TANL-1: Coreference Resolution by Parse Analysis and Similarity Clustering
  • TIPSem (English and Spanish): Evaluating CRFs and Semantic Roles in TempEval-2
  • TRIPS and TRIOS System for TempEval-2: Extracting Temporal Information from Text
  • TUD: Semantic Relatedness for Relation Classification
  • TreeMatch: A Fully Unsupervised WSD System Using Dependency Knowledge on a Specific Domain
  • Twitter Based System: Using Twitter for Disambiguating Sentiment Ambiguous Adjectives
  • UBA: Using Automatic Translation and Wikipedia for Cross-Lingual Lexical Substitution
  • UBIU: A Language-Independent System for Coreference Resolution
  • UC3M System: Determining the Extent, Type and Value of Time Expressions in TempEval-2
  • UCD-Goggle: A Hybrid System for Noun Compound Paraphrasing
  • UCD-PN: Selecting General Paraphrases Using Conditional Probability
  • UCF-WS: Domain Word Sense Disambiguation Using Web Selectors
  • UHD: Cross-Lingual Word Sense Disambiguation Using Multilingual Co-Occurrence Graphs
  • UMCC-DLSI: Integrative Resource for Disambiguation Task
  • UNITN: Part-Of-Speech Counting in Relation Extraction
  • UNPMC: Naive Approach to Extract Keyphrases from Scientific Articles
  • USFD2: Annotating Temporal Expresions and TLINKs for TempEval-2
  • UTD: Classifying Semantic Relations by Combining Lexical and Semantic Resources
  • UTDMet: Combining WordNet and Corpus Data for Argument Coercion Detection
  • UoY: Graphs of Unambiguous Vertices for Word Sense Induction and Disambiguation
  • UvT-WSD1: A Cross-Lingual Word Sense Disambiguation System
  • UvT: Memory-Based Pairwise Ranking of Paraphrasing Verbs
  • UvT: The UvT Term Extraction System in the Keyphrase Extraction Task
  • VENSES++: Adapting a deep semantic processing system to the identification of null instantiations
  • WINGNUS: Keyphrase Extraction Utilizing Document Logical Structure
  • YSC-DSAA: An Approach to Disambiguate Sentiment Ambiguous Adjectives Based on SAAOL

4 comments

Axel
As a translator for 56 years (counting the work I did for my father at 14) I have come to believe that words are meaningless; only texts have meaning. Maybe there are degrees of meaninglessness, with pronouns at the top of the list and Latin plant names at the bottom, but essentially it's all a cloud that makes sense in the eye of a distant beholder. Wrt "black tight mini-skirt", I agree that thrown at me as a random example, yes, it's not "really correct" and the order should be "tight black mini-skirt". But the perception of not-quite-rightness is mitigated, if not removed entirely, if I write "One girl wore a black tight mini-skirt and the other a blue tight mini-skirt" to emphasize the colours. It's all in the context and in the intentions of the artist.

I always say only two groups of people have authority when it comes to language: plumbers and poets. Axel - 2010-07-19 10:48
Matt
I'm not entirely convinced that "black tight mini-skirt" is the best-sounding phrasing when used with "blue tight mini-skirt" as in your example, but if you're right, it means there's a long-range dependency: the preferred order in one phrase changes depending on what's going on some distance away in the sentence. I think your example could even be split into two sentences and it would remain equally valid, so the dependency can jump outside the sentence level: "One girl wore a black tight mini-skirt. The other wore a blue tight mini-skirt." Long-range dependencies like that make a lot of trouble for mathematical models of language. They require you to use a much more powerful computational system to process the language than you would otherwise. Matt - 2010-07-19 11:28
eloj
No offense taken. eloj - 2010-07-19 22:54
Axel
Odd, I said pronouns were at the top of the list for meaninglessness. A slip: pronouns are fairly reliable (although Indo-European languages could use a few more). I mean prepositions. The English say on the island, the French say in the island - or, to be precise, they use the same preposition as for in the box - and don't get me going about countries, to Mexico, in Germany, und so weider. Axel - 2010-07-22 22:58


(optional field)
(optional field)
Answer "bonobo" here to fight spam. ここに「bonobo」を答えてください。SPAMを退治しましょう!
I reserve the right to delete or edit comments in any way and for any reason.