I have attended the GraphConnect conference in New York organized by Neo Technologies, the creator of Neo4J, the most popular graph database (according to them and Wikipedia as well). There were three main presentations and numerous smaller ones, I will summarize what I have learned, what I have found interesting.
Neo4J is by far the most popular graph database according to dbengines.com The main difference between Neo4J and its closest competitors (closest as if comparing popularity of McDonald’s to a local bar in a village) is that Titan and OrientDB are by design distributed graph databases with multi-master architecture while Neo4J has a single master multiple slave replication architecture.
The first presentation was given by Emil Eifreem who started Neo4J and now the CEO of the company behind it. He talked about how he expects the graph databases to gain ever-increasing adoption by all domains not only the few they imagined first. Google has its Knowledge Graph, Twitter its Interest Graph. But the domains of application seems endless as I have seen examples on the conference in enterprise applications, health care, hiring and career management, entertainment industry, travelling, bioinformatics.
Application in enterprise application development – Slides
The following talk was given by David Colebatch presenting lightmesh.com which is a framework to speed up enterprise application development. It is built on a hierarchical structure where entities have sub entities – parts – and declaratively defining entities their attributes and the relationship between them the framework generates UI and the controller logic tying it to the Neo4J database under it.
Without going into further details (look for more information on their site if interested) the presentation was a slight disappointment for me as it was more marketing than useful information on lessons learned while developing the system. Interesting concept for boilerplate code generation but not very practical from developer/architect point of view.
Application in health care
Next came Gino Pirri from Curaspan to present a medical system built on Neo4J. They manage a whole network of hospitals, doctors, nursing homes, insurance companies and patients, they model the whole network in the graph database. They incorporated geo data as well so they can issue really complicated queries based on provided services, location, insurance coverage and other attributes e.g. to find the best fitting provider for a patient.
The access control is also represented in the graph, e.g. administrators of a given company can be assigned to different levels of the hierarchy so that they have access to all entities contained in the given subgraph, but exceptions can also be defined and they represented as a REVOKED relationships between the user and the given entity which the user does not have access to.
Their simplified architecture can be seen on the right. the interesting parts are:
- Neo4J server extensions to implement their bulk loading REST interface on top of Neo4J
- Neo4J server extension for their REST based search API
- they use distributed transactions over the Oracle and Neo4J database, that requires Neo4J 2.0 at least.
- Oracle is used as a single source of truth, it contains all their data, but no need to optimize it for queries as all the queries are run against their graph store
One more lesson: when querying the graph they filter first by geographic location (30 mile radius of a patients address) and then filter on the other criteria.
Design and build graph database applications – Slides
Ian Robinson from Neo Technologies gave a talk on guidelines for building the application using Neo4J. With the approach of Agile development starting from a user story he showed how it can be transformed into a data model and a query that provides the answer to the user requirement.
The next question is how can we test our implementation. By applying TDD, using well-understood datasets and in-memory version of Neo4J, neo4j-kernel and leveraging ExecutionEngine to execute Cypher queries. Neo4J-kernel is not yet production-ready but can be used for testing. In case the implementing language is not a JVM language we cannot use the in-memory version, so we need to fall back to a standalone Neo4J server.
In case we need complex logic on the server side or in order to achieve performance improvement in certain use cases one can implement Neo4J server extensions. To do that we need to implement a JAX-RS annotated REST service that is instantiated every time a request is executed. Besides implementing complex logic one can also change the data of the incoming request to be consumable by Neo4J or change the response generated by Neo4J.
To test our server extension we can use the CommunityServerBuilder class (I could not find apidoc, here is the link to its source on github) that will start up with our JAX-RS extension and then:
- create an HTTP client
- send request to the server
- assert the response
Application in entertainment industry
Peter Olson from Marvel Entertainment presented how they model the Marvel Universe with countless superheroes, their enemies and their appearances in comic books, movies and other media. The benefit of using a graph database is that they are able to model very complex relationships when a superhero is impersonated by different persons, they appear with different names, they belong to different superhero teams and sometimes appear as main character or a villain, they die, resurrect, die again and come back to life. All these are mapped to media appearances. But ordering e.g the comic book issues in itself is challenging as there are multiple ways to look at them. They belong to series, the date of publishing and the timeline of their story line is usually two orthogonal aspects.
Using a graph they can answer questions like: which comic books do I need to read in which order if I want to know everything about Spiderman (including the Amazing Spiderman, the Superior Spiderman, the Avenging Spiderman, the Spider-men, the Avengers vs X-Men and some other series).
Their tech stack includes:
- LAMP (Linux, Apache, MySQL, PHP)
- Node.Js and GO
Their model includes hyperedges which are edges connecting more than two nodes, which is represented in Neo4J with a node that connects all the other nodes connected by the hyperedge in question. E.g. They use Alias to denote a superhero with the given name (Amazing Spiderman) and Persons representing “real world” persons (Peter Parker) and a Moment represents something with a timestamp that happens. E.g. a hero appears in a comic book. In this case Appearance is a hyperedge connecting the Amazing Spiderman and Peter Parker to a Moment. Appearance is represented by a node connected to the Amazing Spiderman and Peter Parker with one edge each and then the Appearance node is connected to the Moment.
Application in bioinformatics – Slides
Jonathan Freeman from Open Software Integrators presented their approach of using Hadoop and Neo4J in bioinformatics more precisely in genetic sequence analysis. This field went through a rapid development in recent years. The first human genome sequencing took 13 years and $2.7 billion, and nowadays it takes about 1 day (approximately 1000 CPU hours) and $5000. Which means it is becoming more and more a reality that people can have their genome analysed and they can have a more precise risk analysis on what diseases they have tendency for, thus prevention can be more effective and people can live a longer and healthier life. Not to mention personalized medicine that can cure diseases better with less side-effects once it is known how the body will react to it. Just imagine what a perspective is to make the genome sequencing of a new born baby and have a plan for her entire life on what to avoid, what lifestyle to follow.
The main challenge of genome sequencing is that the results from the laboratory tests are a host of short nucleotide sequences that need to be aligned with each other to build up the whole sequence. The human genome is about 100Gb in size and contains a lot of noise. However as there are more and more genomes analysed there are more known patterns that can help in the sequence alignment.
One of the many use cases is to find Single-Nucleotide Polymorphisms (short SNP, pronounced snip). These are differences in the nucleotide sequence that occurs among the members of the same species and can be identified as indicators for tendency to a given disease.
Due to the large data the preprocessing is done using Hadoop and then a smaller dataset represented in a graph, and the further processing is done in Neo4J. One can try to find answer to questions like find all the people in a dataset who have SNPs related to e.g. Parkinson. Or given a set of genomes from people with a given disease find possible SNPs that can be the cause.
A workflow application – Slides
Tero Paananen from Jibe.com presented their recruiting platform where job seekers can apply to positions via mobile application. They pull their data from partner recruiting companies and present the search results to the candidates. Their tech stack includes the followings:
- AngularJS, Node.Js
- Spring Data Neo4J
- Neo4J (embedded)
- Redis, Memcached
- Interface to interact with third party systems implemented in Ruby
They use Neo4J to guide through candidates among a sequence of questions he is required to answer to apply for the given positions. Questions include inquiries about personal data (name, date of birth, contact info), professional expertise, law and legal related questions by the employer. The questions and possible answers are modeled as a graph where in case of a multi-choice question the answer can determine the following questions. E.g. if the years of experience is the question, the answer of “3 years or less” can lead to terminate the question flow as the candidate has insufficient experience, while the other options will continue in the normal flow.
A career path management solution – Slides
Matthew Harris from College Miner presented their application patheer.com where users can analyze their career path in relation to others. Their initial research area was to see whether graduates do get a job related to their major and given a certain major what jobs a graduate can apply for with probable success. Later on they extended the application to answer questions like “having a given history of positions the user have and wants to get into her dream position what positions she needs to apply for”. The application analyzes the users’ career paths find the ones that have similar background as the user in question and are or were in the same position as she wants to get into, then gives to most often occurring paths that lead the current user to her dream job.
In their data model the positions of a user are represented as nodes and they are connected with edges when one has directly preceded the other. These path are stored in Neo4J. They obtain the career path data from parsing resumes. They get the raw resumes in many different format from partner HR agencies and parse them using daxtra.com‘s parser.
Application on a travel site – Slides
The presentation of Eddy Wong of wanderu.com was one of the most interesting for be because he talked about many lessons they have learned (mostly in the form of try and fail) while building the site. Wanderu is a travel planner for intercity bus and train trips that allows users to find routes between towns even if each leg of a trip is provided by a different bus company. This is a challenging task as the bus companies (at least their websites) still live in the past century and provide no convenient API to obtain their schedules. Therefore Wanderu buys schedule data from a company that specializes in crawling these websites and extract the schedules.
When they started to use Neo4J they dumped all their data into it. Which turned out to be a not-so-good solution. Their model consisted a few hundred nodes representing cities and a few million edges representing each and every bus trip (if there were buses between two towns in every hour that meant 24 edges). The problem with this approach is that Neo4J stores the edges in a linked list for each nodes, therefore traversing the graph involved going through long linked lists to find the connected nodes which lead to an average query time of over 1 minute.
Also the attributes are stored in a linked list which means having many attributes in a node can slow down the query in case the query depends on (e.g. filters on or returns) attributes closer to the end of the list.
Therefore they decided to use Neo4J to store only edges to represent when two cities are connected and all the other information (scheduled times and other attributes) are stored in MongoDB. They have developed a custom connector from MongoDB to Neo4J using MongoDB Replication in order to keep the Neo4J database up-to-date with the MongoDB which serves as the single source of truth.
To serve a search request in between two cities they wrote a custom join operation that connects the result from Neo4J and MongoDB. To solve the different ID format the two system uses (Neo4J uses mostly sequential integers IDs starting at 1, MongoDB uses a 12 byte ObjectID), they use abbreviations for city names (e.g. NYC, DC) and the connecting edges are identified by the starting and ending city name abbreviation (e.g. NYC-DC).
Another area they shared their experiences is how they do geo lookups. They provide a map showing the user the trip in question. It is implemented using Google’s Places Autocomplete and Google Map. They also need the geolocation of bus stops/stations which is stored in their database, initially in Neo4j, but then moved to MongoDB because of performance reasons.
Graph Marketing – Slides
Jeremi Karnell from beeha.us talked about the emerging concept of graph marketing where the advertisers and sellers are focusing on the different graphs that already exist (Google Knowledge Graph, Facebook’s social graph, Twitter’s interest graph, LinkedIn’s professional graphs) and extract important information from the relationships between the users represented in these graphs.
He talked about the influence loop where users search for a product, read reviews and decide, buy the product, then write a review on it that will attract (or repel) new users.
But graph marketing can put privacy concerns into focus. This field is rapidly changing and evolving, much faster than usually governments are able to keep up with legislation, and there are tough questions. Just compare the problem of spams with the problem of exploiting the knowledge of your social/professional network in advertising. Spams were seen as a critical problem at that time (and it was, without legislation and opt-in/out system the email could have died), but for me it seems a far smaller intrusion into my privacy as knowing that advertising companies have my social graph and I can just hope they will be not malicious while using it. The NSA vs Snowden case (although it is not exactly about marketing) is a great canary in the mine, highlighting private data collection problems and the lack of effective legislation.
The ideal situation by the speaker would be the following: I authorize the marketing agency to use my graph data to give me more relevant ads. Which is I think the case currently, however it is hidden in those lengthy terms and conditions you usually accept without reading. The ideal situation for me would be the same, but show me ads only when I need them, when I am looking for something. Which means of course far less advertisement shown me and far fewer opportunity to lure me into impulsive buying. Which would kill consumer society. Which in turn would affect badly marketing agencies. So there is a conflict in interests, luckily we have adblocks that filter ads from browsers at least.
Another application of social graphs in marketing is to target important users in subgraphs who are influential on a small community and if his/her opinion can be changed to positive towards a brand/product one can expect that the other users in his social graph will also start liking the brand/product.
A little graph theory – Slides
The last presentation was given by Jim Webber from NeoTechnology. It was a great presentation with interesting information and great humor (this is a video from the GraphConnect in San Francisco, but I’m sure it’s worth to watch).
Jim has talked about two structures in graphs that can be used to infer non-existing relationships: Triadic Closure, which is simply put if A likes B, and A likes C then there is a significant probability that B likes C, and Structural Balance, which is in a nutshell when A likes B and A hates C, then this triangle is balanced if B hates C, otherwise there is a tension between A and B (who like each other), but only one of them likes C while the other hates. Also a good example for structural balance is “the enemy of my enemy is my friend”. Using Triadic Closure and Structural Balance was employed by Antal, Krapivsky and Redner to model how the alliances in Europe in the 19th and 20th century lead to a balanced graph of allied and hostile states clashing in world War I.
These implied relationships or weak relationships are important, because they can influence the real-world events and can lead to results that are contradict the predicted results by theories using the graph without weak links. See this Wikipedia article for more on the topic.
I really enjoyed the conference, lot of interesting talks, good presenters and inspiring ideas both in technology and around it. I highly suggest to visit the conference website and watch the videos and check out the slides on Slideshare. If you are interested in graph databases or not yet but want to know more about an emerging technology that surely will get more and more attention in the coming years I advise to go to the next conference or watch the online materials.
Finally a list of ideas, thought I collected throughout the day and want to keep it for future reference:
- Currently there is no such tool for Neo4J like Liquibase or Flyway which can help upgrading/migrating databases during development and in production
- To provide transparent master election in case the master Neo4J server fails in HA situation you need to configure your favorite load balancer that will hold the client request until the master is elected
- Distributed transactions involving Neo4J are possible only using version 2.x
- Running MapReduce on Amazon elastic cloud is relatively cheap: 3 hours for about $85 using a 40-node, 320-core cluster rented from Amazon Web Services can sequence a human genome
- Large public datasets for playing with Hadoop on Amazon
- Open source toolkit for genome resequencing analysis: Crossbow
- Facebook’s public API to work with their social graph: Open Graph