Community Detection in Rohingya Twittersphere using NetworkX — Part 1
Community Detection is one of the key tasks in social networking analysis. It seeks to identify the number of communities in a given network (Kewalramani, 2011; Lu & Halappanavar 2014). It then attempts to identify where connection exists between each community and between each node in the community. A community is a cluster of nodes or vertices grouped together via certain variables including interest and background (Kewalramani, 2011). The objective of Community Detection is to classify each node or vertex in the graph as belonging to the same community (Cornellisen & others, 2019). It should be noted that before communities can be detected and identified, the dataset must first be converted into graph.
To build the graph, we will be using the python package NetworkX which converts Twitter users into nodes and their interactions amongst each other into edges. In this article, we are looking at Rohingya related communities on Twitter. If you are looking into doing your own social network/community detection analysis, you would need to obtain your own Twitter data.
First, all the python libraries that we will be using for the analysis and for the visualisation need to be imported.
import networkx as nx
from networkx.algorithms import community
import pandas as pd
The NetworkX version that I am using is 2.1, which can be found using the following code:
nx.__version__
If your version says 2.4, you can install version 2.1 using the code below in your terminal.
pip install networkx==2.1
Now, we can start uploading two .CSV files that will represent our edges and nodes.
fields = ['source', 'to']
rh_edge = pd.read_csv(r"filepath\filenameedge.csv", usecols = fields)rh_node = pd.read_csv(r"filepath\filenamenode.csv")rh_node.head(15)
Prior to uploading the files, I created a unique alphanumeric code using the random module to represent each unique user for reference later in the analysis.
import random
>>> ''.join(random.choice('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ') for i in range(10))
There are two columns in the rh_edge dataframe which represent who communicates with whom.
rh_edge.head(15)
As one Twitter user (source) can tag multiple users (to) in one tweet, we will need to split using a comma (,). For this, I simply created another column on the rh_edge dataframe and I called it splitted_users.
rh_edge['splitted_users'] = rh_edge['to'].apply(lambda x: x.split(','))
rh_edge.head(15)
Let’s plot our nodes on a NetworkX graph.
graph = nx.Graph()graph.add_nodes_from(rh_node['username'])
nx.set_node_attributes(graph, {'username': 'id'})
Then, the edges…
for e in rh_edge.iterrows():
for user in e[1]['splitted_users']:
graph.add_edge(e[1]['source'], user)
We can now check if our codes worked by running the following codes:
graph.number_of_nodes()74908graph.number_of_edges()130859
We have over 74,000 Twitter users and roughly 131,000 interactions. In our analysis, we want our graph to be connected. A graph is connected if there is a path between every pair of nodes in the graph.
nx.is_connected(graph)
The line above gives a boolean value. If it says False, then the graph is disconnected — which is the case for our graph. Let’s run the code below to see how many connected components there are in our graph. Then, we can extract the largest connected component and work from there.
nx.number_connected_components(graph)
Our main graph (graph) has 5,496 connected components. We only want to take the largest connected component for our analysis and drop the remaining 5,495 smaller graphs. This can be done using the following code:
largest_connected_graph = max(nx.connected_component_subgraphs(graph), key=len)
largest_connected_graph will be the graph that we will be using for analysis, which has 62,787 nodes and 123,426 edges. Now that we have this graph, we can perform the community detection analysis in this dataset using Louvain algorithm or Louvain modularity. This method maximises the modularity score of each community. Modularity is the measure of the ability of nodes to be grouped together in one community (See paper here).
rohingya_community = community.best_partition(largest_connected_graph)
values = [rohingya_community.get(node) for node in largest_connected_graph.nodes()]
I created a new dataframe called rohingya_users_community which presents each node with their respective community number.
rohingya_users_community = pd.DataFrame.from_dict(rohingya_community, orient = 'index', columns = ['Community Number'])rohingya_users_community.index.rename('username' , inplace=True)
Let’s try to count the communities in rohingya_users_community.
empty_nodes={}for x in rohingya_community.items():
comm_num = x[1]
comm_nod = x[0]if comm_num in empty_nodes:
value= " ".join([dict_nodes.get(comm_num),(str(comm_node))])
empty_nodes.update({comm_num:values})
else:
empty_nodes.update({comm_num:comm_nod})
Then, we can create a new graph where we can add the nodes from empty_nodes so that we can calculate the total number of communities that we have created.
rohingya_community_graph = nx.Graph()rohingya_community_graph.add_nodes_from(empty_nodes)len(rohingya_community_graph.nodes())
This code gives us 216 communities. To find out which community has the most number of members, we can use .value_counts() on the rohingya_users_community dataframe which we already created.
rohingya_users_community['Community Number'].value_counts()0 8478
3 8429
14 6640
26 2993
23 2510
Name: Community Number, Length: 216, dtype: int64
Communities 0, 3, 14, 26 and 23 are the top five communities with the largest membership. In the next articles, I will be focusing on the analysis and the visualisation of the graphs using Gephi.