[GIE Compiler] Fix query params of getV base in path expand operator (#2773)
What do these changes do?
In the cypher statement
(a:person)-[b:knows]-(c:person)
that represents a path expansion, the logical plan includes twogetV
operators. One isgetV
base inner thepath_expand
b, another is theendV
operator following the b. It’s important to note that these operators have different query parameters.The
getV
base can retrieve vertices of any type that are adjacent to the type specified in thepath_expand
operation. This pr specifically addresses this requirement. TheendV
operator is constrained by the type specified in the given query.Related issue number
Fixes
Co-authored-by: Longbin Lai longbin.lailb@alibaba-inc.com
GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba 来自阿里巴巴的一站式大规模图计算系统 图分析 图查询 图机器学习
A One-Stop Large-Scale Graph Computing System from Alibaba
GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simply by combining several important pieces of Alibaba technology: including GRAPE, MaxGraph, and Graph-Learn (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively, and the Vineyard store that offers efficient in-memory data transfers.
Visit our website at graphscope.io to learn more.
Table of Contents
Getting Started
We provide a Playground with a managed JupyterLab. Try GraphScope straight away in your browser!
GraphScope supports running in standalone mode or on clusters managed by Kubernetes within containers. For quickly getting started, let’s begin with the standalone mode.
Installation for Standalone Mode
GraphScope pre-compiled package is distributed as a python package and can be easily installed with
pip
.Note that
graphscope
requiresPython
>= 3.8 andpip
>= 19.3. The package is built for and tested on the most popular Linux (Ubuntu 20.04+ / CentOS 7+) and macOS 11+ (Intel) / macOS 12+ (Apple silicon) distributions. For Windows users, you may want to install Ubuntu on WSL2 to use this package.Next, we will walk you through a concrete example to illustrate how GraphScope can be used by data scientists to effectively analyze large graphs.
Demo: Node Classification on Citation Network
ogbn-mag
is a heterogeneous network composed of a subset of the Microsoft Academic Graph. It contains 4 types of entities(i.e., papers, authors, institutions, and fields of study), as well as four types of directed relations connecting two entities.Given the heterogeneous
ogbn-mag
data, the task is to predict the class of each paper. Node classification can identify papers in multiple venues, which represent different groups of scientific work on different topics. We apply both the attribute and structural information to classify papers. In the graph, each paper node contains a 128-dimensional word2vec vector representing its content, which is obtained by averaging the embeddings of words in its title and abstract. The embeddings of individual words are pre-trained. The structural information is computed on-the-fly.Loading a graph
GraphScope models graph data as property graph, in which the edges/vertices are labeled and have many properties. Taking
ogbn-mag
as example, the figure below shows the model of the property graph.This graph has four kinds of vertices, labeled as
paper
,author
,institution
andfield_of_study
. There are four kinds of edges connecting them, each kind of edges has a label and specifies the vertex labels for its two ends. For example,cites
edges connect two vertices labeledpaper
. Another example iswrites
, it requires the source vertex is labeledauthor
and the destination is apaper
vertex. All the vertices and edges may have properties. e.g.,paper
vertices have properties like features, publish year, subject label, etc.To load this graph to GraphScope with our retrieval module, please use these code:
We provide a set of functions to load graph datasets from ogb and snap for convenience. Please find all the available graphs here. If you want to use your own graph data, please refer this doc to load vertices and edges by labels.
Interactive query
Interactive queries allow users to directly explore, examine, and present graph data in an exploratory manner in order to locate specific or in-depth information in time. GraphScope adopts a high-level language called Gremlin for graph traversal, and provides efficient execution at scale.
In this example, we use graph traversal to count the number of papers two given authors have co-authored. To simplify the query, we assume the authors can be uniquely identified by ID
2
and4307
, respectively.Graph analytics
Graph analytics is widely used in real world. Many algorithms, like community detection, paths and connectivity, centrality are proven to be very useful in various businesses. GraphScope ships with a set of built-in algorithms, enables users easily analysis their graph data.
Continuing our example, below we first derive a subgraph by extracting publications in specific time out of the entire graph (using Gremlin!), and then run k-core decomposition and triangle counting to generate the structural features of each paper node.
Please note that many algorithms may only work on homogeneous graphs, and therefore, to evaluate these algorithms over a property graph, we need to project it into a simple graph at first.
In addition, users can write their own algorithms in GraphScope. Currently, GraphScope supports users to write their own algorithms in Pregel model and PIE model.
Graph neural networks (GNNs)
Graph neural networks (GNNs) combines superiority of both graph analytics and machine learning. GNN algorithms can compress both structural and attribute information in a graph into low-dimensional embedding vectors on each node. These embeddings can be further fed into downstream machine learning tasks.
In our example, we train a GCN model to classify the nodes (papers) into 349 categories, each of which represents a venue (e.g. pre-print and conference). To achieve this, first we launch a learning engine and build a graph with features following the last step.
Then we define the training process, and run it.
A Python script with the entire process is available here, you may try it out by yourself.
Processing Large Graph on Kubernetes Cluster
GraphScope is designed for processing large graphs, which are usually hard to fit in the memory of a single machine. With Vineyard as the distributed in-memory data manager, GraphScope supports running on a cluster managed by Kubernetes(k8s).
To continue this tutorial, please ensure that you have a k8s-managed cluster and know the credentials for the cluster. (e.g., address of k8s API server, usually stored a
~/.kube/config
file.)Alternatively, you can set up a local k8s cluster for testing with Kind. We provide a script for setup this environment.
If you did not install the
graphscope
package in the above step, you can install a subset of the whole package with client functions only.Next, let’s revisit the example by running on a cluster instead.
The figure shows the flow of execution in the cluster mode. When users run code in the python client, it will:
graphscope.gremlin
andgraphscope.graphlearn
need to be changed tosess.gremlin
andsess.graphlearn
, respectively.sess
is the name of theSession
instance user created.)Creating a session
To use GraphScope in a distributed setting, we need to establish a session in a python interpreter.
For convenience, we provide several demo datasets, and an option
with_dataset
to mount the dataset in the graphscope cluster. The datasets will be mounted to/dataset
in the pods. If you want to use your own data on k8s cluster, please refer to this.For macOS, the session needs to establish with the LoadBalancer service type (which is NodePort by default).
A session tries to launch a
coordinator
, which is the entry for the back-end engines. The coordinator manages a cluster of resources (k8s pods), and the interactive/analytical/learning engines ran on them. For each pod in the cluster, there is a vineyard instance at service for distributed data in memory.Loading a graph and processing computation tasks
Similar to the standalone mode, we can still use the functions to load a graph easily.
Here, the
g
is loaded in parallel via vineyard and stored in vineyard instances in the cluster managed by the session.Next, we can conduct graph queries with Gremlin, invoke various graph algorithms, or run graph-based neural network tasks like we did in the standalone mode. We do not repeat code here, but a
.ipynb
processing the classification task on k8s is available on the Playground.Closing the session
Another additional step in the distribution is session close. We close the session after processing all graph tasks.
This operation will notify the backend engines and vineyard to safely unload graphs and their applications, Then, the coordinator will release all the applied resources in the k8s cluster.
Please note that we have not hardened this release for production use and it lacks important security features such as authentication and encryption, and therefore it is NOT recommended for production use (yet)!
Development
Building on local
To build graphscope Python package and the engine binaries, you need to install some dependencies and build tools.
Then you can build GraphScope with pre-configured
make
commands.Building Docker images
GraphScope ships with a Dockerfile that can build docker images for releasing. The images are built on a
builder
image with all dependencies installed and copied to aruntime-base
image. To build images with latest version of GraphScope, go to thek8s/internal
directory under root directory and run this command.Building client library
GraphScope python interface is separate with the engines image. If you are developing python client and not modifying the protobuf files, the engines image doesn’t require to be rebuilt.
You may want to re-install the python client on local.
Note that the learning engine client has C/C++ extensions modules and setting up the build environment is a bit tedious. By default the locally-built client library doesn’t include the support for learning engine. If you want to build client library with learning engine enabled, please refer Build Python Wheels.
Testing
To verify the correctness of your developed features, your code changes should pass our tests.
You may run the whole test suite with commands:
Documentation
Documentation can be generated using Sphinx. Users can build the documentation using:
The latest version of online documentation can be found at https://graphscope.io/docs
License
GraphScope is released under Apache License 2.0. Please note that third-party libraries may not have the same license as GraphScope.
Publications
Contributing
Any contributions you make are greatly appreciated!