Title: Unraveling Twitch Streamers Networks with Neo4j: Clustering and Prediction
Introduction
In today’s data-driven world, graph databases have emerged as a powerful tool for storing and analyzing complex relationships between interconnected data entities. One such graph database management system is Neo4j, which offers unique advantages over traditional relational databases. In this blog, we will explore the process of clustering and prediction using a Twitch streamers dataset in Neo4j, along with the steps required for setting up the database, running Cypher queries, and performing Louvain clustering using the Graph Data Science (GDS) plugin.
Setting Up the Database
We will work with a dataset consisting of over 150,000 nodes and 6.5 million edges, representing Twitch streamers and their connections. To load this large dataset quickly, we will use Neo4j’s admin-import terminal command instead of the LOAD_CSV function. This command allows us to load the entire network in less than a minute, whereas the LOAD_CSV function would take more than 40 minutes for loading only the edges.
Exploring the Data with Cypher Queries
Neo4j’s query language, Cypher, is instrumental in deriving insights from large datasets with complex relationships. We will use Cypher to run various queries on our Twitch streamers dataset. Neo4j offers a Graph Data Science (GDS) plugin that provides various clustering algorithms under the “Community Detection” category. For our Twitch streamers dataset, we will use the Louvain community detection function to identify 19 distinct clusters.
Cypher Queries and Clustering Techniques
Cypher, the query language for Neo4j, is widely employed in data science applications due to its capacity to manage intricate data relationships and extract valuable insights from large datasets. Designed with user-friendliness and versatility in mind, Cypher enables data scientists to craft complex queries that are easily understandable by others.
To display all nodes in the network, the “match (n)” command can be executed. However, it is crucial to remember that the Neo4j Browser has a limit on the number of nodes that can be displayed per query, which is adjustable. Consequently, the maximum number of nodes visible on screen at any given moment will be below this limit, and not all nodes may be shown.
The following Cypher command can be used to obtain the top 10 nodes based on the number of connections for this dataset:
match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10
To use the number of views as the criteria, the Cypher command would be:
match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10
Clustering entails grouping similar nodes according to specific criteria. Neo4j offers a plugin called Graph Data Science (GDS), which features various clustering algorithms under the “Community Detection” category. For this dataset, the Louvain community detection function from GDS was employed to generate 19 unique clusters.
The following command can be used to save the network as a graph:
CALL gds.graph.project.cypher('twitch', 'MATCH (n) RETURN id(n) AS id, n.views AS views', 'MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target') YIELD graphName, nodeCount AS nodes, relationshipCount AS rels RETURN graphName, nodes, rels
This command saves a graph in the current runtime with the name “twitch” and the specified attributes.
The Louvain clustering method can be invoked using:
call gds.louvain.write('twitch', {writeProperty:'louvain'})
This command applies the Louvain clustering method and stores the output as a node attribute named “louvain.” To display clusters individually, the node attribute can be converted into a node label using the following code:
match (n) call apoc.create.addLabels([id(n)], [toString(n.louvain)]) yield node with node remove node.louvain return node
Conclusion
Neo4j is a popular database management system used for storing and retrieving complex interconnected data in graph form. It uses the property graph model and the Cypher query language to analyze and represent intricate relationships between data entities. The blog highlights the clustering and prediction outcomes achieved through running diverse queries on a Twitch streamers dataset using Neo4j. The study’s dataset comprised of over 150,000 nodes and 6.5 million edges, which was loaded using Neo4j’s admin-import terminal command, preserving all node attributes. Cypher queries were used to display all nodes, obtain the top 10 nodes based on the number of connections, and set the criteria as the number of views. Neo4j’s Graph Data Science (GDS) plugin was used to perform Louvain clustering, which resulted in 19 distinct clusters. The blog provides commands for saving the network as a graph, invoking the Louvain clustering method, and transforming node attributes into node labels.