2. Cypher

Graph databases have their own query language, GQL, an ISO standard for graph databases. Cypher is Neo4j’s implementation of GQL. Cypher is a declarative language, meaning the database is responsible for finding the most optimal way of executing that query

Reading Graphs

find the person with name

MATCH (n:Person)
WHERE n.name = 'Tom Hanks'
RETURN n

Get the data from the nodes which is derived from a particular node, with a particular relationship and label

MATCH (m:Movie)<-[r:ACTED_IN]-(p:Person)
WHERE m.title = 'Toy Story'
RETURN m, r, p

To get the data in a table form(not in a graph form) return the data in objects format

MATCH (m:Movie)-[r:IN_GENRE]->(g:Genre)
WHERE m.title = 'Toy Story'
RETURN m.title, g.name

Pattern Matching

Cypher is a declarative query language that allows you to identify patterns in your data using an ASCII-art style syntax consisting of brackets, dashes and arrows
Take the Below Query as Example

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[r:ACTED_IN]-(p2:Person)
WHERE p.name = 'Tom Hanks'
RETURN p2.name AS actor, m.title AS movie, r.role AS role

In the above Query, this line acts as Pattern Matching

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[r:ACTED_IN]-(p2:Person)

MATCH - Keyword
Person - Node - The query written in () are used to indicate nodes
ACTED_IN - Relationship - The query written in [] are used to indicate nodes - I we won’t assign to a variable, if we not have a need to reference it later. For example, [:ACTED_IN] not assigned to a variable as we don’t need to use it, where [r:ACTED_IN] assigned as we need to return it
p, m, r, p2 - Variables

Creating Graphs

Creating a Node

MERGE (m:Movie {title: "Arthur the King"})
SET m.year = 2024
RETURN m

Creating Relationships

MERGE (m:Movie {title: "Arthur the King"})
MERGE (u:User {name: "Adam"})
MERGE (u)-[r:RATED {rating: 5}]->(m)
RETURN u, r, m

Search for most Recent Movies

MATCH (m:Movie)
WHERE m.released IS NOT NULL
RETURN m.title AS title, m.url AS url, m.released AS released
ORDER BY released DESC LIMIT 5

Relationships

Getting Nodes with Label Movie who has relationship as ACTED_IN and linked with node with label Person with name Tom Hanks

MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie)
RETURN m.title

There is a DIRECTED relationship between the Person nodes and the Movie nodes - it represents the people who directed the movies. Which MATCH clauses will return the names of people who directed movies

MATCH (m:Movie)<-[:DIRECTED]-(p:Person) RETURN p.name
// or
MATCH (m:Movie)<-[:DIRECTED]-(p) RETURN p.name

Filtering Queries

For filtering we use WHERE clause

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
// filter via released
WHERE m.released = 2008 OR m.released = 2009
// filter with title
WHERE m.title='The Matrix'
// filter with node and title matching
WHERE p:Person AND m:Movie AND m.title='The Matrix'
// range release filtering
WHERE 2000 <= m.released <= 2003
// select name matching where tagline is present 
WHERE p.name='Jack Nicholson' AND m.tagline IS NOT NULL
// name starts with micheal (case-sensitive)
WHERE toLower(p.name) STARTS WITH 'michael'
// name ends with john (case-sensitive)
WHERE toLower(p.name) ENDS WITH 'john'
// name contains river (case-sensitive)
WHERE toLower(p.name) CONTAINS 'River'
// return distinct elements - remove duplicates
RETURN DISTINCT p.name

People who wrote the movie but not directed it. use NOT exists

MATCH (p:Person)-[:WROTE]->(m:Movie)
WHERE NOT exists( (p)-[:DIRECTED]->(m) )
RETURN p.name, m.title

Filtering using Lists

MATCH (p:Person)
WHERE p.born IN [1965, 1970, 1975]
RETURN p.name, p.born

Comparing with existing lists

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE  'Neo' IN r.roles AND m.title='The Matrix' // here we are comparing 'Neo' in r.roles 
RETURN p.name, r.roles

Filtering with Labels

MATCH (p:Person)
WHERE  p.born.year > 1960
AND p:Actor
AND p:Director
RETURN p.name, p.born, labels(p)

Scalar List Functions

MATCH (m:Movie)
WHERE  date(m.released).year = 2000
WITH m ORDER BY date(m.released)
WITH collect(m) AS Movies
// getting starting data of the list for a gist
WITH head(Movies) as First
RETURN First.title AS FirstTitle, First.released AS FirstDate
// getting ending data of the list for a gist
WITH last(Movies) as Last 
RETURN Last.title as LastTitle, Last.released AS LastDate

Using Reduce function - used to perform calculations in the code - If initial value is an empty list. Each element in the list is added to the initial value

WITH [[1,2,3], [4,5,6], [7,8,9]] AS StartingList
RETURN reduce(Calc = [], r IN StartingList | Calc + r) AS FlattenedList

Read the whole file and convert it into a list using split() function clean it

LOAD CSV WITH HEADERS FROM 'https://data.neo4j.com/importing/MovieDataUnclean.csv'
AS row
WITH
row.title AS Title,
row.languages AS Field,
split(row.languages, "|") AS FieldList
RETURN Title, Field, FieldList LIMIT 10

Type Conversions

WITH [
    "abc", 
    false, 
    1, 
    1.5, 
    null, 
    datetime(), 
    date(), 
    LocalDateTime(), 
    point({x: 2.3, y: 4.5})
] AS MyList
 
RETURN 
    MyList, 
    toStringList(MyList) AS StringList, 
    toIntegerList(MyList) AS IntegerList, 
    toFloatList(MyList) AS FloatList, 
    toBooleanList(MyList) AS BooleanList

Test Lists with predicate functions

MATCH p = (a:Actor)-[:ACTED_IN*2]-(a2:Actor)
WHERE a.name = 'Meg Ryan'
 
RETURN 
    all(x IN nodes(p) WHERE x.year > 1900 OR x:Actor) AS AllValid,  // Checks if all nodes are valid
    any(x IN nodes(p) WHERE x.year = 2000) AS AnyFrom2000,         // Checks if any movie is from 2000
    isEmpty(collect(m.title)) AS MoviesEmpty,                      // Checks if the list of movie titles is empty
    none(x IN nodes(p) WHERE x.name = 'Tom Hanks' AND x:Actor) AS NoTomHanks, // Checks if Tom Hanks is not in the path
    single(x IN nodes(p) WHERE x.born.year = a.born.year AND x:Actor) AS OnlyOneSameBirthYear // Checks if only one actor has the same birth year as Tom Hanks

Percentiles and Percentages - Calculate the percentage of movies released in 2000 with a rating > 4

MATCH (m:Movie) WHERE m.year = 2000
WITH count(*) AS TotalMovies
MATCH ()-[r:RATED]-(m) WHERE r.rating > 4 AND m.year = 2000
WITH count(DISTINCT m) AS GoodMovies, TotalMovies
RETURN 
    TotalMovies, 
    GoodMovies, 
    round(toFloat(GoodMovies) / toFloat(TotalMovies) * 100) AS PercentGood, // Percentage calculation
 
// Determine the 50th percentile of a list of numbers
UNWIND [80, 10, 20, 30, 40, 50, 60, 70] AS x
RETURN percentileCont(x, 0.50) AS FiftyPercentile // 50th percentile calculation

Understanding Properties

Get the keys of the properties that a graph dB has

CALL db.propertyKeys()

Get properties of a particular node

MATCH (p:Person)
RETURN p.name, keys(p)

Creating Nodes

We can use both MERGE and CREATE to create nodes
MERGE - Find, if not present then Create
CREATE - Just Create, not care about Duplicates

MERGE (p:Person {name: 'Katie Holmes'})
MERGE (m:Movie {title: 'The Dark Knight'})
RETURN p, m

Creating Relationships

We use MERGE to create relationships

Creating relationship for existing nodes

MATCH (p:Person {name: 'Michael Caine'})
MATCH (m:Movie {title: 'The Dark Knight'})
MERGE (p)-[:ACTED_IN]->(m)

Creating both nodes and relationship in one go

MERGE (p:Person {name: 'Chadwick Boseman'})
MERGE (m:Movie {title: 'Black Panther'})
MERGE (p)-[:ACTED_IN]-(m)
 
// or
 
MERGE (p:Person {name: 'Emily Blunt'})-[:ACTED_IN]->(m:Movie {title: 'A Quiet Place'})
RETURN p, m

Updating Properties

We use SET to add properties and REMOVE to remove them

Add a property and remove the Value of another property

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE p.name = 'Michael Caine' AND m.title = 'The Dark Knight'
SET r.roles = ['Alfred Penny'], m.released = null   // Adds role, remove value of released
RETURN p, r, m

Remove or Delete a property along with key

MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE p.name = 'Michael Caine' AND m.title = 'The Dark Knight'
REMOVE r.roles // deleting roles key
RETURN p, r, m

Note: We should never remove the property that is used as the primary key for a node 3. Matching with relationship types

MATCH (p:Person)-[r]->(m:Movie)
WHERE  p.name = 'Tom Hanks'
RETURN m.title AS movie, type(r) AS relationshipType // make sure role is of relationshipType

Getting data from properties. It gets both name and born properties from the person node

MATCH (p:Person)
WHERE p.name CONTAINS "Thomas"
RETURN p { .name, .born } AS person
ORDER BY p.name

Merge Processing

We use ON CREATE SET and ON MATCH SET to create or update fields on creation either using CREATE or MERGE

// Find or create a person with this name
MERGE (p:Person {name: 'McKenna Grace'})
// Only set the `createdAt` property if the node is created during this query
ON CREATE SET p.createdAt = datetime()
// Only set the `updatedAt` property if the node was created previously
ON MATCH SET p.updatedAt = datetime()
// Set the `born` and `gender` properties regardless
SET p.born = 2006, p.gender = 'Male'
RETURN p

Deleting Data

For Deleting nodes we use DELETE, for removing relationship and deleting nodes in one go we use DETACH DELETE. For adding label we use SET and to remove label we use REMOVE

Deleting Node

MATCH (p:Person {name: 'Jane Doe'})
DELETE p // deletes nodes
// or
DETACH DELETE p // deletes node with relationship

For Adding and removing labels

MATCH (p:Person {name: 'Jane Doe'})
// adding label
SET p:Developer
RETURN p
// removing label
REMOVE p:Developer
RETURN p
// checking labels
CALL db.labels()

Counting & Ordering

We use count after filtering

MATCH (p:Person)-[:ACTED_IN]-(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.title AS Movie

We use ORDER BY for index, DESC/ASC for order and LIMIT for limiting

MATCH (p:Person)-[:ACTED_IN]-(m:Movie)
WHERE m.title = 'Hoffa'
RETURN  p.name AS Actor, p.born as `Year Born` ORDER BY p.born DESC LIMIT 1

Inequality

Names of all actors that acted in the movie Captain Phillips where Tom Hanks is excluded

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name <> 'Tom Hanks'
AND m.title = 'Captain Phillips'
RETURN p.name

Profiling Queries

We use profile keyword for profiling the query, example goes like this

PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE  p.name = 'Tom Hanks'
AND exists {(p)-[:DIRECTED]->(m)}
RETURN m.title

In this we are getting the movies and then running the matching of the relationship onto it. Rather we can do something like this

PROFILE MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(p)
WHERE  p.name = 'Tom Hanks'
RETURN  m.title

Note: when profiling the dB hits will not change but the difference will be the elapsed ms times. If the elapsed ms times not change, then try to create indexes which decrease the dB hits

Conditional Returning

We want to return information about actors who acted in the Toy Story movies. We want to return the age that an actor will turn this year or that the actor died

MATCH (m:Movie)<-[:ACTED_IN]-(p:Person) 
WHERE m.title CONTAINS 'Toy Story' 
RETURN m.title AS movie,
p.name AS actor,
p.born AS dob,
 
CASE 
WHEN p.died IS NULL THEN date().year - p.born.year 
WHEN p.died IS NOT NULL THEN "Died"  
AS ageThisYear

If the cinema runtime is less then 120 mins return “Short”, If more than that return “Long”

MATCH (m:Movie)<-[:ACTED_IN]-(p:Person)
WHERE p.name = 'Charlie Chaplin'
RETURN m.title AS Movie,
CASE
WHEN m.runtime < 120 THEN "Short"
WHEN m.runtime >= 120 THEN "Long"
ELSE "Unknown"
END AS Runtime

Aggregating Data

During a query, the data retrieved is collected or counted. The aggregation of values or counts are either returned from the query or are used as input for the next part of a multi-step query. If you view the execution plan for the query, you will see that rows are returned for a step in the query. The rows that are returned in a query step are an aggregation of property values, nodes, relationships, or paths in the graph

This query will return the movies that have one or two actors

MATCH (m)<-[:ACTED_IN]-(a:Person)
WITH  m, collect(a.name) AS Actors
WHERE size(Actors) <= 2
RETURN m.title AS Movie, m.year as Year, Actors ORDER BY m.year

This query returns a list of movie titles associated with each actor. The rows are presented such that actors with the greatest number of titles are returned first.

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a.name AS actor,
count(*) AS total,
collect(m.title) AS movies
ORDER BY total DESC LIMIT 10

Eliminating Duplicates in list

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 1920
RETURN  collect( DISTINCT m.title) AS movies,
collect(a.name) AS actors

Slicing the data

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN m.title AS movie,
collect(a.name)[2..] AS castMember,
size(collect(a.name)) as castSize

Pattern comprehension is a very powerful way to create lists without changing the cardinality of the query. It behaves like an OPTIONAL MATCH combined with collect(). Notice that for pattern comprehension we specify the list with the square braces to include the pattern followed by the pipe character to then specify what value will be placed in the list from the pattern - [<pattern> | value]

MATCH (m:Movie)
WHERE m.year = 2015
RETURN m.title,
[(dir:Person)-[:DIRECTED]->(m) | dir.name] AS directors,
[(actor:Person)-[:ACTED_IN]->(m) | actor.name] AS actors

Using List Comprehensions to Aggregate

MATCH (m:Movie)-[:ACTED_IN]-(a:Actor)
WHERE a.name = 'Tom Hanks'
WITH m ORDER by m.released DESC
WITH collect(m) AS Movies
WITH [x IN Movies | x.title + ": " + toString(date().year - date(x.released).year + 1)] AS Summary
RETURN Summary

Date & Time

Setting date and time as properties

MERGE (x:Test {id: 1})
SET x.date = date(),
    x.datetime = datetime(),
    x.time = time()
RETURN x

Setting date time values

MATCH (x:Test {id: 1})
SET x.datetime1 = datetime('2022-01-04'),         // only date
    x.datetime2 = datetime('2022-04-09T18:33:05') // both date and time
RETURN x

working with durations

MATCH (x:Test {id: 1})
// difference between
RETURN duration.between(x.date1, x.date2)
// difference in days
RETURN duration.inDays(x.datetime1, x.datetime2).days
// adding durations
RETURN x.date1 + duration({months: 6})
// formatting dates
RETURN x.datetime as Datetime, apoc.temporal.format( x.datetime, 'HH:mm:ss.SSSS') AS formattedDateTime

Varying Length Traversal

This refers to the ability to explore paths in a graph that can have different lengths. For example, in a social network, you might want to find out how many connections (hops) away someone is from another person. Determining Proximity - we can assess how “close” nodes are to each other based on the number of hops (relationships) between them.

Shortest Path Query - This query finds the shortest path between the nodes representing Eminem and Charlton Heston, regardless of the type of relationship

MATCH p = shortestPath((p1:Person)-[*]-(p2:Person))
WHERE p1.name = "Eminem" AND p2.name = "Charlton Heston"
RETURN p

Specific Relationship types - This query finds the shortest path but only considers the ACTED_IN relationship

MATCH p = shortestPath((p1:Person)-[:ACTED_IN*]-(p2:Person))
WHERE p1.name = "Eminem" AND p2.name = "Charlton Heston"
RETURN p

Exactly two hops away - This retrieves all Person nodes that are exactly two hops away from Eminem using the ACTED_IN relationship

MATCH (p:Person {name: 'Eminem'})-[:ACTED_IN*2]-(others:Person)
RETURN others.name

Up to four Hops away - This retrieves all Person nodes that are up to four hops away from Eminem, allowing for a range of connections

MATCH (p:Person {name: 'Eminem'})-[:ACTED_IN*1..4]-(others:Person)
RETURN others.name

Scoping Variables

In Cypher, you can define variables for nodes and relationships in your queries. These variables can be used to filter results or return specific data. The scope of a variable refers to the part of the query where that variable is accessible. The WITH clause is used to pass variables from one part of the query to another. It can also be used to define new variables or expressions

Basic variable definition

MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.title AS movies

Initializing with WITH

WITH 'Tom Hanks' AS actorName
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = actorName
RETURN m.title AS movies

Redefining scope - In this query, after the first MATCH, a new WITH clause is used to redefine the scope. The variable m (the movie node) is passed along, and movieTitle is defined as a transformation of m.title

WITH 'toy story' AS mt, 'Tom Hanks' AS actorName
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH m, toLower(m.title) AS movieTitle
WHERE p.name = actorName AND movieTitle CONTAINS mt
RETURN m.title AS movies, movieTitle

Using Map projections

MATCH (n:Movie)
WHERE n.imdbRating IS NOT NULL AND n.poster IS NOT NULL
WITH n {
  .title,
  .year,
  .languages,
  .plot,
  .poster,
  .imdbRating,
  directors: [(n)<-[:DIRECTED]-(d) | d { tmdbId: d.imdbId, .name }]
}
ORDER BY n.imdbRating DESC LIMIT 4
RETURN collect(n)

Pipelining Queries

Pipelining refers to the process of passing intermediate results from one part of a query to another using the WITH clause. This allows you to build complex queries in a modular way, where the output of one part can be used as input for the next

Basic Pipelining - the first WITH clause limits the results to five movie nodes (m). The second MATCH clause then uses these five nodes to find the directors of those movies. This demonstrates how results can be pipelined from one part of the query to another

WITH 'Tom Hanks' AS theActor
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name = theActor
WITH m LIMIT 5
MATCH (d:Person)-[:DIRECTED]->(m)
RETURN d.name AS director, m.title AS movies

Using WITH for aggregation - The query counts the number of movies in each genre and sums their IMDb ratings. The results are aggregated and passed on to the final RETURN statement, demonstrating how intermediate results can be used to create final outputs

MATCH (:Movie {title: 'Toy Story'})-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(m)
WHERE m.imdbRating IS NOT NULL
WITH g.name AS genre,
count(m) AS moviesInCommon,
sum(m.imdbRating) AS total
RETURN genre, moviesInCommon, total/moviesInCommon AS score
ORDER BY score DESC

Collecting Results - This query collects the names of actors for movies containing “New York” and counts the number of actors. The results are then returned, showcasing how WITH can be used to aggregate data into lists

MATCH (m:Movie)--(a:Actor)
WHERE m.title CONTAINS 'New York'
WITH m, collect(a.name) AS actors, count(*) AS numActors
RETURN m.title AS movieTitle, actors
ORDER BY numActors DESC

Unwinding Lists

The UNWIND clause is used to transform a list into a set of rows. Each element of the list becomes a separate row in the result set. This is particularly useful when you want to work with individual elements of a list in subsequent parts of your query

converting language lists into a list

MATCH (m:Movie)-[:ACTED_IN]-(a:Actor)
WHERE a.name = 'Tom Hanks'
UNWIND m.languages AS lang
RETURN m.title AS movie,
       m.languages AS languages,
       lang AS language

clean the data in the list and return it as row

MATCH (m:Movie)
UNWIND m.languages AS lang
WITH m, trim(lang) AS language
WITH language, collect(m.title) AS movies
RETURN language, movies[0..10]

Subqueries

A subquery is a set of Cypher statements that execute within their own scope. It allows you to perform operations that are separate from the main query, helping to limit the number of rows processed and manage memory usage. Subqueries are enclosed in curly braces {} and can be called from an outer query

Basic Query - The CALL query works as an separate query and it gives filtered results to next code and get processed later

CALL {
   MATCH (m:Movie) WHERE m.year = 2000
   RETURN m ORDER BY m.imdbRating DESC LIMIT 10
}
MATCH (:User)-[r:RATED]->(m)
RETURN m.title, avg(r.rating)

Passing values in subquery

MATCH (m:Movie)
CALL {
    WITH m
    MATCH (m)<-[r:RATED]-(u:User)
    WHERE r.rating = 5
    RETURN count(u) AS numReviews
}
RETURN m.title, numReviews
ORDER BY numReviews DESC

Combining Sub Queries with UNION - combining the result of both the queries

MATCH (m:Movie) WHERE m.year = 2000
RETURN {type:"movies", theMovies: collect(m.title)} AS data
UNION ALL
MATCH (a:Actor) WHERE a.born.year > 2000
RETURN { type:"actors", theActors: collect(DISTINCT a.name)} AS data

Combining result in subqueries

MATCH (p:Person)
WITH p LIMIT 100
CALL {
  WITH p
  OPTIONAL MATCH (p)-[:ACTED_IN]->(m:Movie)
  RETURN m.title + ": " + "Actor" AS work
UNION
  WITH p
  OPTIONAL MATCH (p)-[:DIRECTED]->(m:Movie)
  RETURN m.title + ": " + "Director" AS work
}
RETURN p.name, collect(work)

Transactions
- Single Transaction: Cypher queries run within a single transaction. If a failure occurs, all changes are rolled back, leaving the graph unchanged
- Performance Issues: Importing large volumes of data in one transaction can lead to performance problems and potential failures
- Multiple Transactions: You can split a query into multiple transactions using the CALL clause with IN TRANSACTIONS
- Auto-committing Transactions: The :auto command allows the query to run in auto-committing transactions, which is generally not recommended but necessary for this example

:auto
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/persons.csv' AS row
CALL {
    MERGE (p:Person {tmdbId: toInteger(row.person_tmdbId)})
    SET
    p.imdbId = toInteger(row.person_imdbId),
    p.bornIn = row.bornIn,
    p.name = row.name,
    p.bio = row.bio,
    p.poster = row.poster,
    p.url = row.url,
    p.born = date(row.born),
    p.died = date(row.died)
} IN TRANSACTIONS OF 100 ROWS

Sadiq's Knowledge Vaults

Explorer