Author: Akim van Eersel
Date: 2021-04-21
The main analysis aims to get insights in two major topics.
To address these issues three specific questions were selected to obtain some of the information sought.
After collecting, cleaning and transforming the data some usual steps are processed.
Two JSON files make up the dataset, one gathering the Twitter accounts of members of Congress and the other all the actions (tweets, retweets, etc.) of these accounts. From the features needed to answers our problematics, the following EDR can summary the tables processed.
US states and their population were gathered and add in a table as well.
Among the members of Congress on Twitter, there are extreme cases who have a very large number of followers or write a very large number of tweets and far ahead of their colleagues.
From the observation of the distributions of the different variables linked to the colors, it is difficult to obtain convincing results. However, among the 5 color features metrics, profile link color
seems to contain mostly only shades of red and blue. There is therefore at least one of the variables which verifies our initial hypothesis.
In order to get an overview, a variable that counts the individual words according to NLP rules was created.
Most of tweets follows a similar pattern with a number of words per tweet near the average value of all Congress members tweets. This is expected as Twitter limit the number of characters by tweet to a low amount, 140 in total. This necessarily implies having a limited number of words but leaves a little flexibility in the way of constructing a concise message.
Initial variables weren't under normal law assumptions, as seen previously with their distribution plots. A logarithmic transformation was used to create suitable metrics for the analysis.
There are clearly correlations (statisticly significant) and linear responses between number of followers and number of tweets, as well as number of followers and the account creation date.
However, the number of inhabitants of the states represented by members of Congress has no impact on the number of followers. Which is the reverse of the initial hypothesis.
scatter(congress_members)
After removing meaningless words (i.e. stop words like "the, "a", etc ...), the motion char below show the top 10 most written words between 2008 and 2017.
fig
Many of the most used words are hardly surprising either by the lexical field used or by the political context. Indeed, "american", "US", "bill", "care", "health", "house", "congress" are all words that are strongly related to the nation, to the function, and to the law.
Clearly one would have expected that one of the most used words on Twitter would be "rt", but this was not taken into account when making assumptions.
The words "obamacare" and "trump" appeared in the top 10 most quoted words in 2013 and 2017 respectively. The initial hypothesis is therefore partially verified.
To get additional informations regarding the current conclusions, here's some ideas to consider: