A Data-Driven Look at Tone: Part 1

Mon, Aug 28, 2017 R, twitter, rtweet, tone, open science

tone

Many a blog post has been written on the topic of “tone” in Psychology these days. Many believe there are camps within the field: some espouse the idea of “open sciece”, others believe the old ways are tried and true.

In September 2016, Susan Fiske published an editorial in the APS Observer titled “Mob Rule? or Wisdom of Crowds?”. A draft version of this article, released on Datacolada, described “methodological terrorists” who were “self-appointed data police volunteering critiques of such personal ferocity and relentless frequency that they resemble a denial-of-service attack that crashes a website by sheer volume of traffic”.

That claim launched a thousand ships…er, blog post, tweet storms, email chains, and has been reverborating for the past year.

Susan Fiske seems to be critiquing the “open science” movement - a grassroots uprising within psychology and other fields that is attempting to change publishing practices and increase awareness of issues in data collection and analysis in an attempt to decrease the number of false positive findings in the published academic literature.

Does this group have an issue with tone? Let’s try and collect some data!

I reached for the data that was easiest to get - tweets (attack of the convenience sample!). Twitter also has the reputation of being a hive of scum and villany, so if I am going to find tone issues, this is a good place to find it.

I first searched for users that have “open science” in their twitter biography. I did one search using rtweet, which has a cap of 1,000 users per search - so I went with the first 1,000 people.

##   n_distinct(screen_name)
## 1                    1000

I then took each twitter handle and scraped the most recent 200 tweets from that profile. If the profile had less than 200 tweets, I took all of them.

After some errors in collecting some tweets, I wound up collecting tweets from 975 out of the 1,000 users. From these 975 users, I collected 180,439 tweets.

##   n_distinct(screen_name) length(text)
## 1                     975       180439

I then split all the tweets into individual words - this is the most basic way of looking at text content. I’ll be delving into more later, but as a fast first pass, this may be informative. I got rid of common stop words (things like “and”, “if”, “but”, etc) and an additional list of common non-words (“http”, “rt”, “1-10”, “amp”, “de”, “la”, “en”).

I also removed “trump” - this word is a confound in the forthcoming sentiment analysis. Typically “trump” is a positive word, as in “I trumped my opponent” - but odds are the use of this word these days is referring to the current president. As I didn’t want to wade into political tweets, I removed this word for now.

After removing the stop words, I had a total of 1,603,751 words tweeted from “open science” tweeters (is that what they’re called?).

Here is a table of word frequencies after removing the stop words and the custom list of words.

## # A tibble: 323,313 x 2
##           word     n
##          <chr> <int>
##  1     science  9293
##  2        data  7394
##  3    research  5403
##  4        time  3648
##  5      people  3424
##  6 openscience  3173
##  7         day  3018
##  8       check  2458
##  9        join  2411
## 10        read  2383
## # ... with 323,303 more rows

So it is good that, from these 975 users with “open science” in their twitter biography, “science”, “data” and “research” are the most common words in their tweets!

Here is the list in graph form, looking at the top words by frequency.

Then, sentiment words are parsed out. This decreased the total word amount from 1,603,751 total words (after stop word removal) to 108,397 sentiment words. Words like “science” and “data” are not sentimental, but words like “love” and “hate” are - so these are the words I continued with.

What is the breakdown of positive and negative words?

## # A tibble: 2 x 2
##   sentiment     n
##       <chr> <int>
## 1  positive 62933
## 2  negative 45464

## [1] 1.384238

Here we can see that positive words are more frequent than negative words: the ratio of positive to negative words is 1.384.

Below is a word cloud of the top 100 most frequent sentimental words (including positive and negative words)

And here is a word cloud of the top 50 words, broken down by positive (blue) and negative(red) words.

So far, it does not seem that those who identify with “open science” are producing critiques with the frequency to resemble a denial-of-service attack.

But maybe the “open science” folks are more negative than some other sub-group of scientific tweeters?

Next post, I’ll find a comparison group of tweeters and do a side-by-side comparison!

.. .. ..

Cliffhanger ending!