Notebook

Formalia:¶

Please read the assignment overview page carefully before proceeding. This page contains information about formatting (including formats etc), group sizes, and many other aspects of handing in the assignment.

If you fail to follow these simple instructions, it will negatively impact your grade!

Due date and time: The assignment is due on Tuesday, April 5th at 23:55. Hand in your Jupyter notebook file (with extension .ipynb) via DTU Learn (Course Content, Assignemnts, Assignment 2)

Remember to include in the first cell of your notebook:

the link to your group's Git repository
group members' contributions

Part 1: TF-IDF¶

For this exercise, you need the following data:

The r/wallstreetbets submissions (either the one provided by me here or the one you downloaded in Week 6).
The list of 15 stocks you identified in Week 6, Exercise 2.

Exercise

Tokenize the text of each submission. Create a column tokens in your dataframe containing the tokens. Remember to follow the instructions in Week 6, Exercise 3.

Find submissions discussing at least one of the top 15 stocks you identified above (follow the instructions in Week 6, Exercise 3).

Now, we want to find out which words are important for each stock, so we're going to create several *large documents, one for each stock*. Each document includes all the tokens related to the same stock. We will also have a document including discussions that do not relate to the top 15 stocks.

Now, we're ready to calculate the TF for each word. Find the top 5 terms within 5 stocks of your choice.

Describe similarities and differences between the stocks.

Why aren't the TFs not necessarily a good description of the stocks?

Next, we calculate IDF for every word.

What base logarithm did you use? Is that important?

We're ready to calculate TF-IDF. Do that for the 5 stock of your choice.

List the 10 top TF words for each stock.

List the 10 top TF-IDF words for each stock.

Are these 10 words more descriptive of the stock? If yes, what is it about IDF that makes the words more informative?

Visualize the results in a Wordcloud and comment your results (follow the instrutions in Week 6, Exercise 4).

Part 2: Sentiment analysis¶

Exercise: Creating Word Shifts

Pick a day of your choice in 2020. We call it $d$. It is more interesting if you pick a day where you expect something relevant to occur (e.g. Christmas, New Year, Corona starting, the market crashes...).

Build two lists $l$ and $l_{ref}$ containing all tokens for submissions posted on r/wallstreebets on day $d$, and in the 7 days preceding day $d$, respectively.

For each token $i$, compute the relative frequency in the two lists $l$ and $l_{ref}$. We call them $p(i,l)$ and $p(i,l_{ref})$, respectively. The relative frequency is computed as the number of times a token occurs over the total length of the document. Store the result in a dictionary.

For each token $i$, compute the difference in relative frequency $\delta p(i) = p(i,l) - p(i,l_{ref})$. Store the values in a dictionary. Print the top 10 tokens (those with largest relative frequency). Do you notice anything interesting?

Now, for each token, compute the happiness $h(i) = labMT(i) - 5$, using the labMT dictionary. Here, we subtract $5$, so that positive tokens will have a positive value and negative tokens will have a negative value. Then, compute the product $\delta \Phi = h(i)\cdot \delta p(i)$. Store the results in a dictionary.

Print the top 10 tokens, ordered by the absolute value of $|\delta \Phi|$. Explain in your own words the meaning of $\delta \Phi$. If that is unclear, have a look at this page.

Now install the shifterator Python package. We will use it for plotting Word Shifts.

Use the function shifterator.WeightedAvgShift to plot the WordShift, showing which words contributed the most to make your day of choice d happier or more sad then days in the preceding 7 days. Comment on the figure.

How do words that you printed in step 6 relate to those shown by the WordShift?

Part 3: Communities for the Zachary Karate Club Network¶

Exercise: Zachary's karate club: In this exercise, we will work on Zarachy's karate club graph (refer to the Introduction of Chapter 9). The dataset is available in NetworkX, by calling the function karate_club_graph

Visualize the graph using netwulf. Set the color of each node based on the club split (the information is stored as a node attribute). My version of the visualization is below.

Write a function to compute the modularity of a graph partitioning (use equation 9.12 in the book). The function should take a networkX Graph and a partitioning as inputs and return the modularity.

Explain in your own words the concept of modularity.

Compute the modularity of the Karate club split partitioning using the function you just wrote. Note: the Karate club split partitioning is avilable as a node attribute, called "club".

We will now perform a small randomization experiment to assess if the modularity you just computed is statitically different from $0$. To do so, we will implement the double edge swap algorithm. The double edge swap algorithm is quite old... it was implemented in 1891 (!) by Danish mathematician Julius Petersen(https://en.wikipedia.org/wiki/Julius_Petersen). Given a network G, this algorithm creates a new network, such that each node has exactly the same degree as in the original network, but different connections. Here is how the algorithm works.

a. Create an identical copy of your original network.

b. Consider two edges in your new network (u,v) and (x,y), such that u!=v and v!=x.

c. If none of edges (u,y) and (x,v) exists already, add them to the network and remove edges (u,v) and (x,y).

Repeat steps b. and c. to achieve at least N swaps (I suggest N to be larger than the number of edges).

Double check that your algorithm works well, by showing that the degree of nodes in the original network and the new 'randomized' version of the network are the same.

Create $1000$ randomized version of the Karate Club network using the double edge swap algorithm you wrote in step 5. For each of them, compute the modularity of the "club" split and store it in a list.

Compute the average and standard deviation of the modularity for the random network.

Plot the distribution of the "random" modularity. Plot the actual modularity of the club split as a vertical line (use axvline).

Comment on the figure. Is the club split a good partitioning? Why do you think I asked you to perform a randomization experiment? What is the reason why we preserved the nodes degree?

Use the Python Louvain-algorithm implementation to find communities in this graph. Report the value of modularity found by the algorithm. Is it higher or lower than what you found above for the club split? What does this comparison reveal?

Compare the communities found by the Louvain algorithm with the club split partitioning by creating a matrix D with dimension (2 times A), where A is the number of communities found by Louvain. We set entry D(i,j) to be the number of nodes that community i has in common with group split j. The matrix D is what we call a confusion matrix. Use the confusion matrix to explain how well the communities you've detected correspond to the club split partitioning.

Exercise: Community detection on the GME network.

Consider the GME network you built in Week 4, part 2.

Use the Python Louvain-algorithm implementation to find communities. How many communities do you find? What are their sizes? Report the value of modularity found by the algorithm. Is the modularity significantly different than 0?

Visualize the network, using netwulf (see Week 4). This time assign each node a different color based on their community. Describe the structure you observe.