antmarakis.github.io

LGBT Representation on Wikipedia

Thu, 10 Mar 2022 00:00:00 +0000

The authors examine the portrayal of LGBT people on Wikipedia, comparing against heteronormative people. A dataset of LGBT and non-LGBT bios were collected and annotated from Wikipedia. Further, the authors paired LGBT with non-LGBT people based on their similarity of characteristics. The main tool they use to frame this problem is contextual affective analysis.

Contextual affective analysis is an NLP technique that analyzes portrayals along power, agency and sentiment axes. To help in this analysis, connotation frames have been developed, which are lexicons of verbs that elicit implications. To collect those from Wikipedia, annotators where asked questions such as “does the subject have less/equal/more power than the object” and ‘‘does the subject have low/moderate/high agency” etc.

The way the agency of LGBT people is portrayed is sometimes misleading. The authors bring up Alan Turing as an example, where in the Spanish and Russian Wikipedia entries he ‘chose’ and ‘preferred’ the hormonal injections treatment which show high agency, in contrast to the ‘accepted’ used in the English entry which shows low agency.

Overall, the authors find that LGBT people are portrayed more negatively in Russian than non-LGBT people. In English, LGBT people are portrayed in a more positive way, while in Spanish there is a balance between LGBT and non-LGBT people. It is noted that this finding aligns with the perception of LGBT people in these countries, according to a recent study they cite. There is little difference when it comes to the person’s date of birth (although in English, there were neutral connotations for people born before 1900, positive between 1900-1960 and positive for people born after 1960). There is, though, more significant differences when it comes to occupation. In Spanish, entertainers are viewed negatively while other occupations are neutral, but in Russian, entertainers are viewed neutrally while other occupations are more negative.

Chan Young Park et al. Multilingual Contextual Affective Analysis of LGBT People Portrayals in Wikipedia. 2021. arXiv: 2010.10820 [cs.CL].

WANLI - Worker/AI Process for NLI Dataset Creation

Mon, 21 Feb 2022 00:00:00 +0000

A main challenge in data curation is that humans, while we can write correct text, oftentimes write repetitively and without diversity. A model trained on the resulting dataset is prone to overfitting, which may lead to low performance on out-of-domain data.

In this work, the authors are proposing a joint human-machine scheme for data collection, bringing together the generative strengths of machine learning models with the evaluation strengths of humans. With DataMaps, they identify examples most helpful to training (for example, the ambiguous ones) and use GPT-3 to generate more similar examples (by asking the model to generate pairs of premise and entailment/neutral/contradiction in a paradigm similar to few-shot learning). Then, humans evaluate and complete the annotation of these examples, either via adjustments to the text (for example, improving the fluency) or amending labels.

WaNLI results in improvements over multiple out-of-domain datasets, include 11% improvement on HANS and 9% on Adversarial NLI. Noteworthy is the proposed technique is not tied to NLI datasets, but can be used for any classification task. Further, when finetuning on MNLI plus an auxiliary dataset, WaNLI performs the best as the auxiliary dataset. Still, finetuning solely on WaNLI results in the second best performance, only behind pairing WaNLI with an auxiliary adversarial NLI dataset.

Finally, the authors analyze artifacts present in WaNLI. They show that a model that only sees one of premise/hypothesis (that is, there is no way to infer the label) has an accuracy of approximately 42%, while the experiment on MNLI results in 50% accuracy. Further, semantic similarity between the premise and the three labels (neutral, entailment and contradiction) is computed for MNLI and WaNLI. In MNLI, the premise is more similar to the entailment, followed by neutral and then contradiction. In WaNLI, representations are much closer with more overlap, making it harder to distinguish (especially between neutral and contradiction). Finally, with DataMaps the authors show that in MNLI most examples are easy to learn, while in WaNLI the distribution is more spread out.

In conclusion, the authors not only propose an interesting and useful data collection process, but also propose an NLI with (seemingly) more robustness than the current datasets.

Alisa Liu et al. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation. 2022. arXiv: 2201.05955 [cs.CL].

Lipstick on a Pig - Evaluation of Gender Debiasing Methods

Wed, 09 Feb 2022 00:00:00 +0000

The authors evaluate gender debiasing methods on a series of proposed experiments, showing that gender bias is not removed but is instead still hidden in word embeddings. The two evaluated methods reduce the projection of words on a gender direction, either via post-processing or during training. These methods are shown to only reduce bias of certain words, while the biased relationships between words still remain.

First, the authors employ a clustering method to cluster together the top-500 most biased words as male and female. Clusters align with actual gender bias 99.9% of the time. After debiasing, this number is down to 92.5% with one method and down to 85.2% with the other. This shows that whereas bias for individual words is reduced, biased words are still neighbors of other biased words.

Finally, the authors train an SVM classifier to predict gender of words based only on representation. Before debiasing, accuracy is at 98%, after debiasing accuracy is at 88% and 96%.

Thus, the authors show that these pioneering methods in gender debiasing only work on a shallow level.

Hila Gonen and Yoav Goldberg. “Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, url: https://aclanthology.org/N19-1061.

Racial Bias in Hate Speech

Fri, 29 Oct 2021 00:00:00 +0000

It is known that bias exists in hate speech datasets across many axes. In this work, the authors are investigating staple Twitter hate speech datasets for bias against African American English. They find that classifiers predict AAE as hate speech more often than White American English. This corroborates previous findings and our intuition that AAE has been labeled as hate speech disproportionately more often.

To identify AAE tweets, they make use of the popular Blodgett classifier. The terminology used to denote the AAE and WAE tweets is “black-aligned” and “white-aligned”. The examined null hypothesis that there is no racial bias is that the classifier will label examples independently of the author’s race. If the probability of assigning a class to an example is higher when the race of the author is black/white than white/black, then we can reject the null hypothesis. In that case, racial bias does exist. The authors expect that white-aligned tweets are more likely to use racist language since white people are more likely to be racist against black people than black people themselves are. The authors are also expecting to see no difference on the axis of sexism for the two examined ethnicities.

The authors show that there is a racial bias in datasets, with tweets from black authors being more likely to be hateful, offensive, or sexist, while white authors are more likely to be racist.

In the discussions section an interesting point is made about the use of the “n-word” in AAE. Supported by Waseem et al. (2018), they claim that in this case the word shouldn’t be considered offensive and we shouldn’t penalize African-Americans for using the term. I believe this debate is at the core of hate speech research: it is so very heavily dependent on context and world knowledge that it is impossible to tackle with current methods. What happens if a white user uses the term? What if they are quoting the term? This is such a fine point that it has caused paralysis among researchers in the area (and unsurprisingly so, for this is a daunting task).

Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial bias in hate speech and abusive language detection datasets. In Proceedings of the Third Workshop on Abusive Language Online, pages 25–35, Florence, Italy. Association for Computational Linguistics.

Documenting Large Webtext Corpora

Thu, 21 Oct 2021 00:00:00 +0000

The Colossal Clean Crawled Corpus (C4) is a corpus curated for pretraining large language models. It is clean in that a set of filters was applied, from banning a list of offensive words to applying language detection (only keeping English examples). The authors here provide an extensive analysis of this corpus.

One of the main findings is that a lot of machine-generated text was found, for example through translations of patents. Also, a lot of content comes from patent websites themselves (the top website by far is patents.google.com, while patents.com is in the top 10), as well as US government websites (.gov and .mil, for the military). Other NLP datasets (including their test sets) were found in the corpus as well, mostly because the datasets were extracted from Wikipedia/IMDB/etc. or because they were uploaded on, for example, Github. Worryingly, since a lot of filters were keyword-based to remove text containing keywords from a list of banned words, text about and from African Americans and the LGBT+ community, among other communities, was filtered out.

In an analysis of metadata, the authors found that while 92% of the documents were from the last decade (2010-2019), some of the documents date back to 1996. Also, around 50% of the documents come from the US, followed by Germany and the United Kingdom.

As for demographic biases, it was found that in the corpus positive mentions of Hebrews were at 73% while for Arabs it was 65%. Further, mentions of sexual orientation were filtered out (both heterosexual and LGBT+), with approximately only 30% of these mentions being actually offensive (after a manual inspection). Also, African and Hispanic English is filtered out at a disproportionate degree too (around 4 times more likely to be removed than White American English).

The authors also found that data from patents was not only the most prevalent, not only was it quite often translated from another language, but that originally some of this data came from OCR systems that could have potentially added further noise.

Finally, to increase transparency, the authors propose that data excluded via filtering should also be available for completion and to allow for further analysis.

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, Empirical Methods on Natural Language Processing (EMNLP), 2021.

Expert-Annotated Misogyny Dataset

Fri, 25 Jun 2021 00:00:00 +0000

In this paper, a Reddit dataset annotated by trained experts is presented, alongside a hierarchical taxonomy of misogyny found online. The entries in the dataset take context into account, following conversation threads. They also annotate the specific span of the text that includes misogynistic content.

Targeted sampling was used, with the authors picking some known misogynistic subreddits. To balance the data bias towards specific terms being associated with misogynistic content, the authors also sampled text from random subreddits. Thus, certain keywords will appear in a general context as well as in a misogynistic one.

Labeling took place on three levels. First is a binary class (misogynistic and non-misogynistic), then granularity is added within each class, while finally labels such as “threatening language” are applied. For misogynistic content, the following labels are presented:

Misogynistic Pejoratives: derogatory language terms against women
Misogynistic Treatment: comments that advocate for dangerous or disrespectful actions against women
Misogynistic Derogation: comments that imply that women are inferior
Gendered Personal Attacks: attacks that are about the recipient’s gender

For non-misogynistic content, the authors capture non-misogynistic personal attacks and counter speech (replies that counter abusive language) as well as neutral content.

Annotators were involved in the development of the definitions’ hierarchy, and during the annotation phase, weekly meetings were held where annotation disagreements were hashed out with an expert PhD researcher as the intermediary. These meetings took place until a consensus was reached. For the binary task (first level), inter-annotator agreement was 0.48 both for Fleiss’ Kappa and Krippendorf’s Alpha, which is higher or at least on par with the other mentioned work. Kappa values for the granular misogynistic labels (level two) can be seen in the following table (as seen on the paper).

As for the distribution of classes, 10\% are misogynistic, with the most prevalent subclasses being pejoratives and derogation (both at 4\%).

In their experiments, model performance was overall quite low. The best examined model is BERT with class weighting (to account for the low percentage of positive examples). This model scores a precision score of 0.38, while recall is at 0.5, F1 at 0.43 and accuracy at 89\%. In their error analysis, the authors found a lot of false positives. Mentions of women, even if they are not misogynistic, get classified as misogynistic.

Ella Guest, Bertie Vidgen, Alexandros Mittos, Nishanth Sastry, Gareth Tyson, and Helen Margetts. 2021. An expert annotated dataset for the detection of online misogyny. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1336–1350, Online. Association for Computational Linguistics.

Intersectional Bias in Hate Speech Datasets

Fri, 05 Mar 2021 00:00:00 +0000

Previous work has examined bias in hate speech datasets with regards to race and with regards to gender, but the intersection of the two has not been examined yet. This is an oversight, since intersectionality is all the rage in social science circles and poses an interesting question to the NLP community.

The dataset investigated here is the Founta Twitter dataset. Labels for gender, race and political party affiliation were generated by models trained on relevant data. For race a previously developed model is used to classify the language used in tweets as closer to African or White American speech. For gender and political affiliation the authors trained models on datasets containing such labels. A big issue with this approach is accuracy. If a white male speaks in afro-american lingo, the tweet will be classified as coming from an African American. And vice-versa, if a black man speaks without dialect, he will likely be marked as a White American.

Inspired by social science literature, the authors hypothesize that tweets from African Americans will be marked as offensive/hateful more often than their White American counterparts. And furthermore, tweets by African American males will be labeled as problematic more often that those written by African American females. This hypothesis is found to hold true, although the intersectional bias seems to be small. Namely, while African American males are almost four times more likely to have tweets labeled as offensive/hateful than White Americans, compared to African American women the increase is only 77%.

An interesting follow-up would be to see whether African Americans agree that tweets by African Americans are hateful or not. Maybe the problem again lies with annotator bias. Maybe though the problem lies with the setup of this study, where these sensitive labels were generated by models and may not very accurately depict reality.

Jae Yeon Kim et al. Intersectional Bias in Hate Speech and Abusive Language Datasets. 2020. arXiv: 2005.05921 [cs.CL].

Words Are Windows to the Soul (Of News Spreaders)

Tue, 09 Feb 2021 00:00:00 +0000

Alongside fake news text representation, the representations of users who spread these fake news can also be utilized in detection. It has been shown that users who share fake news also share some common characteristics. In this paper user representations are generated from the text these users produce and combined with the representation of news, providing an increase in detection accuracy. They also analyze language used by fake news spreaders and find that not only do they have a distinct focus of topics and emotions, but also typographic variations (eg. abnormal use of punctuation) to normal users. Their analysis takes place over the FakeNewsNet datasets (PolitiFact and GossipCop).

For user representations, both their tweets and descriptions are used. The model has two CNN-based modules, one for users and one for news. These two resulting representations are given to a one-layer feedforward network.

The authors experiment with providing the model only with news, only with user information and both. When providing both, performance is the best for CNNs. When provided only with user information, the model performance drops although not by that much.

An analysis of linguistic features is also presented. N-grams are categorized into multiple features, such as punctuation, pronouns, topics, proper names, etc. For topics, Negative Emotion and Death appear often in fake news, whereas for real news the most prominent topics are Government and Politics. For news spreaders, there seems to be little overlap between fake and real news spreaders. Fake news spreaders make use of similar vocabulary across both datasets, whereas real news spreaders do not. As with previous studies, fake news sharers use more punctuation marks, emotive language and first-person pronouns. Real news spreaders use more domain specific words. Finally, similarity between fake news and its spreaders is low, whereas it is higher for real news.

As an ablation study, they investigate whether echo chambers exist within spreader circles. They conjecture that the similarity between user representations drops the further away two users are in the social graph. This indeed seems to be the case, both for fake and real news spreaders.

Marco Del Tredici and Raquel Fern´andez. “Words are the Window to the Soul: Languagebased User Representations for Fake News Detection”. In: Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 5467{5479. url: https://www.aclweb.org/anthology/2020.coling-main.477.

How Decoding Methods Affect Generated Text Verifiability

Wed, 03 Feb 2021 00:00:00 +0000

Work in the field of generated text analysis has focused on fluency and grammar. In this paper the accordance of generated text with real world knowledge is put to the test. It seems that the decoding method plays a major role in how verifiable text is. There is also a high correlation between repetitive text and verifiability. So, whereas top-k and nucleus sampling produce less repetitive text, that text is also less verifiable. The authors propose a decoding strategy that produces at the same time less repetitive and more verifiable text.

Text that is verifiable is text that can be either corroborated or refuted by a knowledge base (for example, Wikipedia).

Their decoding method is a hybrid of sampling and likelihood-maximization approaches. The first L tokens are generated with sampling and the rest with beam search. By tuning L they can hit a balance between the two approaches. Text is generated after given five tokens from a Wikipedia article.

Text can be verified (supported or refuted) or unverified. The authors measure both the number of supported sentences over all generated sentences, and the supported sentences out of all verified sentences. To counter duplication of text, only unique supports are counted. To measure repetitiveness, they employ two further metrics: distinct 4-grams and 4-gram overlap between human and machine-generated text. Human text is the first 256 tokens as found in the corresponding Wikipedia article.

To fact-check the statements, they make use of an off-the-shelf BERT-based model trained on FEVER. It is shown that likelihood approaches (like greedy or beam search) indeed produce more verifiable text (even though it is more repetitive). Their hybrid decoding approach achieves a good balance between repetitiveness and verifiability. Human evaluation was also carried out, and the results from the fact-checker model were corroborated.

Luca Massarelli et al. “How Decoding Strategies Affect the Verifiability of Generated Text”. In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 223{235. doi: 10.18653/v1/2020. findings - emnlp . 22. url: https://www.aclweb.org/anthology/2020.findings-emnlp.22

Stolen Probabilities

Wed, 27 Jan 2021 00:00:00 +0000

In this paper, the inability of NNLMs to assign high probabilities to rare words in high-probability sentences (for example, to “America” in “United States of America”) is examined. Here AWD-LSTMs are studied.

Words with larger embedding norm (that is, more frequent words) will be assigned higher probabilities relative to a less frequent word, even when the two words appear in similar contexts. The authors show that words on the convex hull of a group of words in the embedding space can achieve high probabilities, whereas words on the inside of the convex hull are bound by the probability of words on the convex hull.

The authors develop an approximation algorithm to find interior and non-interior words, with high precision and low recall. They extract the top 500 words for each set, according to their probability. On those 500-word sets, trigram probabilities are extracted. These trigram probabilities do not showcase the stolen probability effect.

In general, the difference between non-interior and interior words is much starker for NNLMs, showing that the average probability of interior words is on average lower in that setup. The authors show that ensembling the NNLM probabilities with models that do not exhibit the stolen probability effect improves performance substantially. The more the dimensions, the smaller this effect is (ie. higher probabilities in the interior).

David Demeter, Gregory Kimmel, and Doug Downey. Stolen Probability: A Structural Weakness of Neural Language Models. 2020. arXiv: 2005.02433 [cs.LG].