In the next coming year, what will be the future for the graph domain and why we need graph domain?
What is the advantage of a graph?
Graph can
For the graph domain, the essential thing is how to find the graph structure, for instance, Neural Network can be a graph. Nonetheless, how can we define the graph results in different properties. The Neural Network can be viewed as a graph, e.g., the DAG, computation graph can be one, nonetheless, it is less informative. How to define a better graph as the network structure can be a good direction (recent ICLR has two good submissions)
The graph can have different different usage, especially for system II intellegence, here are a few examples:
Why GNN can work?
Why we need graph?
How to build a meaningful graph from a non-graph data?
How to remove the structure from the graph data?
]]>山光物态弄春晖，莫为轻阴便拟归。 纵使晴明无雨色，入云深处亦沾衣。
今天是个挺有趣的日子，和致凯老师合作的paper今天终于过了100个star了，合作的paper rebuttal也小涨了一点分数，想想和致凯也已经合作很久了，晚上吃饭的时候也和致凯简单回味了一下这一年怎么一点点的到了今天。突然想聊聊这一年我和致凯老师一起在graph领域探索的一些故事，致凯老师也算第一个真正意义上和我合作的同学（特指一段完整的研究） 简单记录一下我生活当中的流水账，也很感谢致凯能出现在我的生活当中，博士这段旅程很枯燥，但能有几个朋友在身边感觉真的很好。
第一次和致凯接触是在他来美国之前两天加到群里的时候，之前一直听jt（我导师）说春季要入学两个同学，但大家一直不知道是谁。刚加进群里就加了下好友，问下他来之前有啥需求，当时记得帮他拉进msu的华人群让他看看有没有合适的转租，还给了他Michiganflyer的地址告诉他怎么从底特律来lansing。但第一次和致凯的相遇并不是特别的舒适，印象是周一的组会jt问新来的同学来了没，当天下午致凯好像来了（记得也没说几句话），jt的comment是so independent（当时好像只有我和haoyu主动加了致凯，大家都觉得新同学充满了神秘）。我们的lab就这么又多了一个朋友。
和大部分做graph的同学一样，致凯进组的第一件事情就是读spectral graph theory or graph signal process。”Go back to old school. When I was a PhD student….” 以及经典的juanhui和haoyu没有go deep into this direction. 这样的话似乎每个新生来了都会听一遍。带着jt的期许，致凯开始了他的学习和presentation。而这个时候我在干啥？这是我博士期间第二挣扎的时刻，第一是现在 (uptil not)。作为一个typical的30%（具有跳跃性思维的某种人格），我博士的第一个课题不能说大胆，只能说是异想天开，可当时的我刚读博士，年轻气盛，目中无人，并不这么觉得。GNN这么差，我一定能设计出一个比GNN更好的MLP！我尝试了各种优化目标以及各种各样的训练方法（从此以后再也不信bilevel optimization了），但似乎连gnn的影子都追不上，为了破局，我每两周课题方向都很可能做一个180°的大转弯，从synthetic设计到test aggregation，现在看那时做的slides都觉得羞耻。记得有一天晚上，因为ppt每次都在转向，把所有合作者都绕懵了，jt在zoom和我聊了一个多小时，回到家在slack打字又聊了一个多小时，还在lab和wei和haoyu纠缠了半个多小时。还记得那晚下了雪暴，树上晶莹剔透都是冰晶，或许大概我这课题可能真的做不出来了吧。（也很感谢那个时候jt一直consistent坚持这个课题啊）
山重水复疑无路，柳暗花明又一村。过了两天我脑子突然冒出一个idea，既然这么想让mlp超过gnn，为啥不直接拿mlp和gnn比一下？虽然略微偏离了原来研究的轨道，但似乎也能走到最后的通路？（过了几周又一次在组会前一天晚上灵光乍现，想出了论文里introduction里出现的例子，第一篇文章我的评价是真的是吃灵感的折阳寿之作）。虽然当时jt觉得新的路线也能做，但还是一直坚持原来的方向很有意义，就让致凯帮我做些实验，把我原来的方向做下去。（我印象当时还跟致凯想了一个很抽象的heterophily labal propogation的idea）致凯的实验结果不出所料，差的一批，最后这个方向也就不了了之了，而致凯和我开始合作，帮我跑一些baseline。这篇paper最后中了NeurIPS，但我不认为这是一段很好的合作，反思下来，第一次合作我并不是很懂合作和沟通的技巧，而且也确确实实让致凯干了很多很多体力活（调参）。
日子似乎就这样一天天过去，在实验快做完但是离ddl还比较远的一天（这篇paper整整脱产写了一个月时间，omg），我爸爸同学的孩子（在umich交换）要来msu找我，为了接他，我和zhiyu晚出门了一会，在家等着也有点无聊，就打开youtube正好刷到了yao fu的talk，我两就在客厅一起看了起来。当时对gpt的认知还不太完整，不过很庆幸有好多朋友为我引路，haonan的硬件发展史和jieyu在neurips就指出了prompt这条明路，这些高瞻远瞩的朋友也真是我的财富。在惶恐和不安之余，这是我第一次真正坐下来认真的正视LLM，新的技术新的方向，一切都是那么的新奇和激动，我感觉似乎一切都是崭新的开始。那天下午带着弟弟喂了松鼠回来，我累的瘫坐在lab的椅子上，决定要把gpt这么好的技术用在graph上，给致凯发了这个消息，给他发了个talk，问他愿不愿意做这个方向，他懵懵懂懂，这似乎就是一切的起点。
下周和jt one-one之前，我又问了致凯要不要做这个，第二天早上聊了这个方案的可行性后，记得当天就在白板上和jt聊了这个想法，他觉得不错，可以试一试。从后事实的角度来看，我们这个idea确实不太novel，但做的过程却比想象的艰难很多，好像没几个人想过graph这样不规则的数据结构怎么能和gpt这样处理seqeunce的模型相结合。市面上除了jianan和eli chien的两篇和bert结合的工作外，似乎没有什么其他可参考的工作，更凄惨的是，似乎没有太多数据集有文本特征来让我们进行处理，以及我们也不是做nlp的同学，对如何用gpt也不是十分理解，api的接口调用也不是很熟悉。包括JT也我们说, I think this idea is weired but you guys can have a try 以及 We do not change our direction, it is how high we climb on the mountain. 我印象当时我说了句，but this is new generation.
在彷徨中，我们逐步向前，没有数据集致凯就去找古早文献，自己手搓一个。转机发生在neurips结束后两三天，bid paper的时候看到了xiaoxin的graph LLM paper，发现他们也做LLM，更震惊的事情是他们处理好了好几个标准的数据集。因为看不到原文，我们当时还猜这文章估计是jianan之前在msra实习组的。当看到xiaoxin开源的数据集，我们的实验终于丰富了起来，意料之外的是，这篇文章最后成为了一篇technical report，发现了一些有的没的结论。完全出乎我们意料的是，这篇文章很快的引起了大家的关注（当然这篇文章也是我所有文章里宣传里最卖力的一个），github repo的star也一直在涨，论文发出去两个月，我们发现，可能我们赌对了。
在第一篇的实验观测上，我们顺理成章找到了LLM在graph领域的一种正确打开方式，但没想到要这么快投出去，我和致凯友谊升华的部分可能就在我们72个小时没离开实验室，日夜兼程一起赶文章的日子，累了就在桌子上趴一会，最后一晚甚至我两都快赶不行了，我在他的paper上改着改着，突然看到overleaf上他下线了，走到他工位看看咋回事，发现他头没低下来就在座位上昏昏沉沉的睡着了，过了半个小时，我坐在座位上一样，写着写着眼睛一闭，手还在键盘上就睡着了。论文按时交付的一刻，我两像战争结束从战壕当中爬出来的两兄弟一样，回家直接摊睡了一天。和以往的赶完ddl不同，这次我在想，就这样结束了吗，有人在和我一起干真好，但愿这一切辛苦都是值得的吧。
我一直觉得致凯是个慢热的人，随着接近一年的朝夕相处，我们也渐渐变成无话不说的朋友，从第一次来我家喝汤的拘谨，到来我家lansing厨房的快乐时光。或许没有去过很多地方，但在科研上我们确实见过了不少的风景。可能我们这种A+B的研究没什么值得骄傲的，他没有什么像初创公司成功一样波澜壮阔般精彩，也经历了不少的批评，但这种进一步的欣喜是真实的，我们切身体会到的。一个笃定而高效的执行者，给我人生带来的惊喜。
感谢致凯塑造了今天的我，一个更大胆, 思维更不局限的，更会与人交流的博士生。我期待博士后面的一切故事都能如此，平淡中有一些惊喜，和一些挚友，真实发生。也期待后面的日子我们每一篇工作都比上一篇更好。或许一起看起来自然而然，但或许我们背后的故事也并不简单， 纵使晴明无雨色，入云深处亦沾衣。
]]>Graph is a very basic data structure we learned in the Data Structures and Algorithms course. It naturally represents each instance as the node, and each edge denotes the pairwise relationship. It can be a natural representation of arbitrary data. For instance, in the computer vision domain, the image can be viewed as a grid graph. In the natural language processing domain, the sentence can be viewed as a path graph. In AI4Science, the graph can easily adapt to all scientific problems.
Graph Neural Networks (GNNs) is proposed which utilizes the strong capability of Neural Network on graph structural data. GNN architectures have found wide applicability across a myriad of contexts, with graph data drawn from diverse sources like social networks, citation networks, transportation networks, financial networks, to chemical molecules. Nonetheless, there is no consistent winning solution across all datasets, owing to the varied concepts that these graphs encode. For instance, GCN may work well on particular social networks while falling short in molecule graphs for it cannot capture particular key patterns on the graph.
Motivated by such a problem, in this blog, we focus on the question that when do GNN work and when not? In light of these questions, we
The insights gleaned from this understanding will serve as acatalyst for the advancement of model development for novel graph datasets, thereby fostering the wider adoption of GNNs in emerging applications.
Our typical findings are:
We will dive deep into both the underlying graph data mechanism and model mechanism in this blog and provide a full picture of Graph Neural Networks for node classification.
If you have any questions on the blog, feel free to send email to haitaoma@msu.edu
In this section, we will provide a brief introduction on task, model, and data properties we focus on and the main analysis tool we utilize.
The semi-Supervised Node Classification (SSNC) task is to predict the categories or labels of the unlabeled nodes based on the graph structure and the labels of the few labeled nodes. We normally use message propagation methods through the connections in the graph to make educated guesses about the labels of the unlabeled nodes. It has wide applications in inferring node attributes, social influence prediction, traffic prediction, air quality prediction, and so on.
Graph neural networks learn node representations by aggregating and transforming information over the graph structure. There are different designs and architectures for the aggregation and transformation, which leads to different graph neural network models.
We will mainly introduce GCN, a fundamental yet representative model. For one particular node, GCN aggregates the transformed features from its neighbors and does the averaging process.
$\textbf{Graph Convolutional Network (GCN).}$
From a local perspective of node $i$, GCN’s work can be written as a feature averaging process:
\[\mathbf{h}_i = \frac{1}{d_i}\sum_{j \in \mathcal{N}(i)}\mathbf{Wx}_j\]_where $\mathbf{h}_i$ denotes the aggregated feature. $d_i$ denotes the degree of node $i$, $\mathcal{N}(i)$ denotes the neighbors of node $i$, i.e., $d_i = \left| \mathcal{N}(i) \right|$. $\mathbf{W}^{(k)} \in \mathbb{R}^{l \times l}$ is a parameter matrix to transform the features, while $\mathbf{x}_j$ denotes the initial feature of node $j$. Notably, the weight transformation step will not be the focus of our paper since it is general in deep learning. We typically focus on the aggregation The key reason for the aggregation step is that people assume that the model can be neighborhood nodes are similar to the center node, which is called homophily assumption. Therefore, aggregation can benefit from such similarity and achieve a smooth and discriminative representation.
Recent works reveal different graph properties, e.g., degree, the length of the shortest path, could influence the effectiveness of GNN. Among them, people recognize that homophily and heterophily are the most important properties, which are the key focus of this paper. People generally believe that the neighborhood nodes are similar to the center node, which is called homophily. Therefore, aggregation can benefit from neighborhoods to achieve a more smooth and discriminative representation.
Homophily. If all edges only connect nodes with the same label, then this property is called Homophily, and the graph is call a Homophilous graph.
In Fig.1, the number denotes the label, and different colors denote distinct features. It is shown that all nodes with similar features have edges connected, and also share the same label, illustrating a perfect homophily.
Heterophily. If all edges only connect nodes with different labels, then this kind of attribute is called Heterophily and the graph is called a Heterophilous graph. Fig.2 below shows a heterophilous graph. In this toy example, each node with label 0(1) only connects nodes with label 1(0).
Graph Homophily Ratio.
Given a graph $\mathcal{G} = {\mathcal{V, E} }$ and node label vector $y$, the edge homophily ratio is defined as the fraction of edges that connect nodes with the same labels. Formally, we have:_
\[h(\mathcal{G}, \{y_i; i \in \mathcal{V}\}) = \frac{1}{\left| \mathcal{E} \right| } \sum_{(j, k) \in \mathcal{E}} \mathbb{I}(y_j = y_k)\]where $ | \mathcal{E} | $ is the number of edges in the graph and $\mathbb{I}(\cdot)$ denotes the indicator function. |
A graph is typically considered to be highly homophilous when $0.5 \le h(\cdot) \le 1$. On the other hand, a graph with a low edge homophily ratio ($0 \le h(\cdot) < 0.5$) is considered to be heterophilous.
Node Homophily Ratio.
Node homophily ratio is defined as the proportion of a node’s neighbors sharing the same label as the node. It is formally defined as:
\[h_i = \frac{1}{d_i} \sum_{j \in \mathcal{N}(i)} \mathbb{I}(y_j = y_i)\]where $\mathcal{N} (i)$ denotes the neighbor node set of $v_i$ and $d_i = | \mathcal{N}(i) |$ is the cardinality of this set.
Similarly, node $i$ is considered to be homophilic when $h_i \ge 0.5$, and is considered heterophilic otherwise. Moreover, this ratio can be easily extended to higher-order cases $h_{i}^{(k)}$ by considering $k$-order neighbors $\mathcal{N}_k(v_i)$.
To examine whether GNNs perform well or not, we focus on whether GNN can encode a discriminative representatation. For instance, the ideal discrminative representation can be described as: (1) Cohension: nodes with the same label are mapped into similar representation (2) Seperation: nodes with different labels are mapped into dis-similar representations. The Fig.3 below illustrates an example of high cohension and good seperation, where each color indicates one class. We can observe that each cluster is in the same class while different clusters are distant from each other. We can then expect to use a simple linear classifier to achieve high performance, which shows an ideal representation.
In this section, we illustrate that GNNs can actually do better: Homophily, nodes connect with similar ones, which is not a necessity for the success of GNNs. GNNs can still work on various heterophily datasets (nodes connect with dissimilar ones). To achieve this goal, we focus on whether GNN can achieve discriminative representation in different settings.
In this subsection, we examine when GCN can map nodes with the same label to similar embeddings. We first play with toy graph examples, homophily and heterophily graphs, which are shown in Section 2.3. In particular, we examine node representations from different classes after the GNN.
GCN under homophily: The aggregation process for the homophily graph is shown in Fig. 4, where node color and number represent node features and labels, respectively. We can easily observe that, after mean aggregation, all the nodes with class 1 are in blue, and class 2 in red, indicating a good discriminative ability. x
GCN under heterophily The aggregation process for the homophily graph is shown in Fig. 5, where node color and number represent node features and labels, respectively. We can easily observe that there appears a color alternation. Before aggregation, all the nodes with class 1 are in blue, and class 2 is in red. In contrast, all the nodes with class 1 are in red, and class 2 is in blue after mean aggregation. Nonetheless, such alternation does not influence the discriminative ability. Notably, the nodes with the same class are still in the same color while nodes with different classes are in different colors, indicating a good discriminative ability.
More rigorously, we provide a theoretical understanding of what kinds of graphs could benefit from the GNNs and how. GNN can perform well on the graphs satisfying:
The rigorous theoretical analysis are shown as follows: (You can skip the following part for heavy math!)
Consider a graph $\mathcal{G} = \mathcal{V}, \mathcal{E}, { \mathcal{F}_{c}, c \in \mathcal{C} }$, ${ \mathcal{D}_{c}, c \in \mathcal{C}} $.
For any node $i\in \mathcal{V}$, the expectation of the pre-activation output of a single GCN operation is given by \(\mathbb{E}[{\bf h}_i] = {\bf W}\left( \mathbb{E}_{c\sim \mathcal{D}_{y_i}, {\bf x}\sim \mathcal{F}_c } [{\bf x}]\right).\)
and for any $t>0$, the probability that the distance between the observation ${\bf h}_i$ and its expectation is larger than $t$ is bounded by
\[\mathbb{P}\left( \|{\bf h}_i - \mathbb{E}[{\bf h}_i]\|_2 \geq t \right) \leq 2 \cdot l\cdot \exp \left(-\frac{ deg(i) t^{2}}{ 2\rho^2({\bf W}) B^2 l}\right)\]where $l$ denotes the feature dimensionality and $\rho({\bf W})$ denotes the largest singular value of ${\bf W}$, $B\geq\max _{i, j}|\mathbf{X}[i, j]|$.
We can than have the rigorous conclusion that the inner-class distance (distance between $h_i$ the expectation in the same class $\mathbb{E}[h_i]$) on the GCN embedding is small with a high probability, which is due to the sampling from its neighborhood distribution $\mathcal{D}_{y_i}$. Notably, the key step in the proof is the Hoeffding inequality. Details can be found in the paper.
To further verify the validity of the theoretical results, we provide more empirical evidence as follows. In particular, we manually add synthetic edges to control the homophily ratio of a graph and examine how the performance varies.
When adding synthetic heterophily edges on a homophily graph, there are two typical things to control:
As we insert heterophilous edges, the graph homophily ratio will also continuously decrease. The results are plotted in Fig.6.
Each point on the plot in Fig.6 represents the performance of GCN model and the corresponding value in the $x$-axis denotes the homophily ratio. The point with homophily ratio $h=0.81$ denotes the original $Cora$ graph, i.e., $K=0$.
The observations are shown as follows:
The experiment verifies our findings. If the neighborhood follows a similar distribution, GCN is still able to perform well under extreme heterophily. However, if we introduce noise to the neighborhood distribution, the effectiveness of GCN will not be guaranteed.
In section 3, we discuss the scenario when GNN can do well, including both homophily and heterophily graphs. All the above analyses are from a graph (global) perspective, verifying that the GNN can achieve overall performance gain. However, when we look closer into a node (local) perspective, we find the overlooked vulnerability of GNNs.
Instead of understanding from a graph perspective, the following analyses focus on nodes in the same graph but with different properties. We first plot the distribution of node homophily ratio on different datasets, shown in Fig.7. We typically include two homophily graphs and two heterophily ones. Additional results on ten different datasets can be found in the original paper. $h$ in the brackets indicating the graph homophily ratio. The $h_{node}$ on the $x$-axis denotes the node homophily ratio. We can clearly observe that:
Equipped with the analysis of node-level data patterns, we then investigate how GNN performs on nodes with different patterns. In particular, we compare GCN with MLP-based models since they only take the node features as the input, ignoring the structural patterns. If GCN performs worse than MLP, it indicates the vulnerability of GNNs. Experimental results are illustrated in Fig. 8.
We can observe that:
Similar to Section 3.2, we first conduct an analysis on a similar toy example. This time, instead of considering GNN under homophily and heterophily separately, we take the homophily and heterophily patterns together into consideration. The illustration is shown in Figure 9. The aggregation process for the homophily graph is shown in Fig. 4, where node color and number represent node features and labels, respectively.
We can observe that when considering the homophily and heterophily together:
The above observations on the toy model show that GNN cannot work well on both homophily and heterophily ones. Then we further ask if GNN can learn homophily or heterophily ones well. The answer will be the majority ones in the training set.
Motivated by the toy example, we then provide theoretical understanding rigioursly from a node level. We find that two keys on test performance are:
The following theorem is based on the PAC-Bayes analysis, showing that both large aggregation distance and homophily ratio difference between train and test nodes lead to worse performance. (You can skip the following part for heavy math!)
The theory typically aims to bound the generalization gap between the expected margin loss $\mathcal{L}_{m}^{0}$ on test subgroup $V_m$ for a margin $0$ and the empirical margin loss $\hat{\mathcal{L}}_{\text{tr}}^{\gamma}$on train subgroup $V_{\text{tr}}$ for a margin $\gamma$. Those losses are generally utilized in PAC-Bayes analysis. The formulation is shown as follows:
Theorem (Subgroup Generalization Bound for GNNs):
Let $\tilde{h}$ be any classifier in the classifier family $\mathcal{H}$ with parameters ${ \tilde{W}_{l} } _{l=1}^{L}$ .
For any $0< m \le M$, $\gamma \ge 0$, and large enough number of the training nodes $N_{\text{tr}}=|V_{\text{tr}}|$, there exist $0<\alpha<\frac{1}{4}$ with probability at least $1-\delta$ over the sample of $y^{\text{tr}} = { y_i } $, $i \in V_{\text{tr}}$ we have:
\[\mathcal{L}_m^0(\tilde{h}) \le \mathcal{L}_\text{tr}^{\gamma}(\tilde{h}) + O\left( \underbrace{\frac{K\rho}{\sqrt{2\pi}\sigma} (\epsilon_m + |h_\text{tr} - h_m|\cdot \rho)}_{\textbf{(a)}} + \underbrace{\frac{b\sum_{l=1}^L\|\widetilde{W}_l\|_F^2}{(\gamma/8)^{2/L}N_\text{tr}^{\alpha}}(\epsilon_m)^{2/L}}_{\textbf{(b)}} + \mathbf{R} \right)\]The bound is related to three terms:
(a) describes both large homophily ratio difference $|h_{\text{tr}} - h_m|$ and large aggregated feature distance $\epsilon = \max_{j\in bV_m}\min_{i\in V_{\text{tr}}} |g_i(X, G)-g_j(X, G)|_2$ between test node subgroup $V_m$ and training nodes $V_{\text{tr}}$ lead to large generalization error. $\rho= |\mu_1 - \mu_2 |$denotes the original feature separability, independent of structure. $K$ is the number of classes.
(b) further strengthens the effect of nodes with the aggregated feature distance $\epsilon$, leading to a large generalization error.
(c) $R$ is a term independent with aggregated feature distance and homophily ratio difference, depicted as $\frac{1}{N_\text{tr}^{1-2\alpha}} + \frac{1}{N_\text{tr}^{2\alpha}} \ln\frac{LC(2B_m)^{1/L}}{\gamma^{1/L}\delta}$, where $B_m= \max_{i\in V_\text{tr}\cup V_m}|g_i(X,G)|_2$ is the maximum feature norm. $\mathbf{R}$ vanishes as training size $N_0$ grows.
Our theory suggests that both homophily ratio difference and aggregated feature distance to training nodes are key factors contributing to the performance disparity. Typically, nodes with large homophily ratio differences and aggregated feature distance to training nodes lead to performance degradation.
To further verify the validity of the theoretical results, we provide more empirical evidence showing the empirical performance disparity. In particular, we compare the performance of different node subgroups divided with both homophily ratio difference and aggregated feature distance to training nodes. For a test node $i$, we measure the node disparity by
We then sort test nodes in terms of $s_1$ and $s_2$ and divide them into 5 equal-binned subgroups accordingly. We include popular GNN models including GCN, SGC (Simplified Graph Convolution), GAT (Graph Attention Network), GCNII (Graph Convolutional Networks with Inverse Inverse Propagation), and GPRGNN (Generalized PageRank Graph Neural Network). The Performance of different node subgroups is presented in Fig.9. We note a clear test accuracy degradation with respect to the increasing differences in aggregated features and homophily ratios.
We then conduct an ablation study that only considers aggregated features distance and homophily ratios in Figures 10 and 11, respectively. We can observe that the decrease tendency disappears in many datasets. Only combining these factors together provides a more comprehensive and accurate understanding of the reason for GNN performance disparity.
Inspired by the findings, we investigate the effectiveness of deeper GNN models on SSNC tasks.
Deeper GNNs enable each node to capture a more complex higher-order graph structure than vanilla GCN, by reducing the over-smoothing problem. Deeper GNNs empirically exhibit overall performance improvement. Nonetheless, which structural patterns deeper GNNs can exceed and the reason for their effectiveness remains unclear.
To investigate this problem, we compare vanilla GCN with different deeper GNNs, including GPRGNN, APPNP, and GCNII, on node subgroups with varying homophily ratios. Experimental results are shown in Fig.11. We can observe that deeper GNNs primarily surpass GCN on minority node subgroups with slight performance trade-offs on the majority node subgroups. We conclude that the effectiveness of deeper GNNs majorly contributes to improved discriminative ability on minority nodes.
Having identified where deeper GNNs excel, reasons why effectiveness primarily appears in the minority node group remain elusive. Since the superiority of deeper GNNs stems from capturing higher-order information, we further investigate how higher-order homophily ratio differences vary on the minority nodes, denoted as, $|h_u^{(k)}-h_v^{(k)}|$, where node $u$ is the test node, node $v$ is the closest train node to test node $u$. We concentrate on analyzing these minority nodes $V_{\text{mi}}$ in terms of default one-hop homophily ratio $h_u$ and examine how $\sum_{u\in V_{\text{mi}}} |h_u^{(k)}-h_v^{(k)}|$ varies with different $k$ orders.
Experimental results are shown in Fig.12, where a decreasing trend of homophily ratio difference is observed along with more neighborhood hops. The smaller homophily ratio difference leads to smaller generalization errors with better performance.
In this blog, we investigate when GNN works and when not. We find that the effectiveness of vanilla GCN is not limited to the homophily graph. Nonetheless, the vulnerability is hidden under the success of GNN. We typically provide some suggestions before you build your own solution to the graph problem.
We remain some questions for future works:
[1]Ma, Yao and Jiliang Tang. “Deep learning on graphs.” Cambridge University Press, 2021.
[2]Ma, Yao, Xiaorui Liu, Neil Shah, and Jiliang Tang. “Is homophily a necessity for graph neural networks?.” arXiv preprint arXiv:2106.06134 (2021).
[3]Mao, Haitao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. “Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?.” arXiv preprint arXiv:2306.01323 (2023).
[4]Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).
[5]Hamilton, Will, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large graphs.” Advances in neural information processing systems 30 (2017).
[6]Xu, Keyulu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. “How powerful are graph neural networks?.” arXiv preprint arXiv:1810.00826 (2018).
[7]Fan, Wenqi, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. “Graph neural networks for social recommendation.” In The world wide web conference, pp. 417-426. 2019.
[8]Zhu, Jiong, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. “Beyond homophily in graph neural networks: Current limitations and effective designs.” Advances in neural information processing systems 33 (2020): 7793-7804.
]]>论文网址：
https://arxiv.org/abs/2307.03393
代码地址：https://github.com/CurryTang/Graph-LLM
图是一种非常重要的结构化数据，具有广阔的应用场景。在现实世界中，图的节点往往与某些文本形式的属性相关联。以电商场景下的商品图(OGBN-Products数据集)为例，每个节点代表了电商网站上的商品，而商品的介绍可以作为节点的对应属性。在图学习领域，相关工作常把这一类以文本作为节点属性的图称为文本属性图(Text-Attributed Graph, 以下简称为TAG)。TAG在图机器学习的研究中是非常常见的, 比如图学习中最常用的几个论文引用相关的数据集都属于TAG。除了图本身的结构信息以外，节点对应的文本属性也提供了重要的文本信息，因此需要同时兼顾图的结构信息、文本信息以及两者之间的相互关系。然而，在以往的研究过程中，大家往往会忽视文本信息的重要性。举例来说，像PYG与DGL这类常用库中提供的常用数据集(比如最经典的Cora数据集)，都并不提供原始的文本属性，而只是提供了嵌入形式的词袋特征。在研究过程中，目前常用的 GNN 更多关注于对图的拓扑结构的建模，缺少了对节点属性的理解。
相比于之前的工作，本文主要研究如何更好地处理文本信息，以及不同的文本嵌入与GNN结合后如何影响下游任务的性能。要更好地处理文本信息，那目前最流行的工具便非大语言模型(LLM)莫属(本文考虑了BERT到GPT4这些在大规模语料上进行了预训练的语言模型，因此使用LLM来泛指这些模型)。相比于TF-IDF这类基于词袋模型的文本特征，LLM有以下这几点潜在的优势。
考虑到LLM的多种多样性，本文的目标是针对不同种类的LLM设计出合适的框架。鉴于LLM与GNN的融合问题，本文把LLM首先分类为了嵌入可见与嵌入不可见两类。像ChatGPT这类只能通过接口进行交互的LLM就属于后者。其次，针对嵌入可见类的LLM，本文考虑三种范式：
对于这些嵌入可见的大模型，可以首先用它们来生成文本嵌入，然后将文本嵌入作为GNN的初始特征从而将两类模型融合在一起。然而，对于嵌入不可见的ChatGPT等LLM，如何将它们强大的能力应用于图学习相关的任务便成为了一个挑战。
针对这些问题，本文提出了一种将LLM应用到图学习相关任务的框架，如下图1与图2所示。对于第一种模式LLMs-as-Enhancers,主要是利用大模型的能力对原有的节点属性进行增强，然后再输入到GNN模型之中来提升下游任务的性能。针对嵌入可见的LLM，采取特征级别的增强，然后采用层级或迭代式(GLEM, ICLR 2023)的优化方法将语言模型与GNN结合起来。对于嵌入不可见的LLM，采取文本级别的增强，通过LLM对原有的节点属性进行扩充。考虑到以ChatGPT为代表的LLM的零样本学习与推理能力，本文进一步探索了利用prompt的形式来表示图节点的属性与结构，然后利用大模型直接生成预测的模式，将这种范式称为LLMs-as-Predictors。在实验部分，本文主要采用了节点分类这一任务作为研究对象，我们会在最后讨论这一选择的局限性，以及拓展到别的任务上的可能。接下来，顺延着论文中的结构，在这里简要分享一下各种模式下有趣的发现。
首先，本文研究利用LLM生成文本嵌入，然后输入到GNN中的模式。在这一模式下，根据LLM是否嵌入可见，提出了特征级别的增强与文本级别的增强。针对特征级别的增强，进一步考虑了语言模型与GNN之间的优化过程，将其进一步细分为了级联式结构(cascading structure)与迭代式结构(iterative structure)。下面分别介绍两种增强方法。
对于特征级别的增强，本文考虑的主要是语言模型、GNN、以及优化方法三个因素。从语言模型上来说，本文考虑了以Deberta为代表的预训练语言模型、以Sentence-BERT为代表的开源句子嵌入模型、以text-ada-embedding-002为代表的商业嵌入模型，以及以LLaMA为代表的开源大模型。 对于这些语言模型，本文主要从模型的种类以及模型的参数规模来考量其对下游任务的影响。
从GNN的角度来说，本文主要考虑GNN设计中的消息传递机制对下游任务的影响。本文主要选取了GCN,SAGE与GAT这两个比较有代表性的模型，针对OGB上的数据集，本文选取了目前排行榜上名列前茅的模型RevGAT与SAGN。本文也纳入了MLP对应的性能来考察原始嵌入的下游任务性能。
从优化方法的角度，本文主要考察了级联式结构与迭代式结构。对于级联式结构，本文考虑直接通过语言模型输出文本嵌入。对于那些规模较小可以进行微调的模型，本文考虑了基于文本的微调与基于结构的自监督训练(ICLR 2022, GIANT)。 不管是哪种方式，最后会得到一个语言模型，然后利用它来生成文本嵌入。这一过程中，语言模型与GNN的训练是分开的。对于迭代式结构，本文主要考察GLEM方法(ICLR 2023)，它使用EM和变分推断来对GNN和语言模型进行迭代式的共同训练。
在实验部分，本文选取了几个有代表性的常用TAG数据集，具体的实验设定可以参考我们的论文。 接下来，首先展示这一部分的实验结果(鉴于空间有限，在这里展示了两个大图上的实验结果)，然后简要讨论一些有意思的实验结果。
从实验结果来看，有以下几个有意思的结论。
第一，GNN对不同的文本嵌入有截然不同的有效性。特别明显的一个例子发生在Products数据集上，以MLP作为分类器时，经过微调的预训练语言模型Deberta-base的嵌入要比TF-IDF的结果好很多。然而，当使用GNN模型后，两者的差异很小，特别是使用SAGN模型时TF-IDF的表现要更好。这一现象可能与GNN的过光滑、过相关性有关，但目前还没有比较完整的解释，因此也是一个有意思的研究课题。
第二，使用句子向量模型作为编码器，然后与GNN级联起来，可以获得很好的下游任务性能。特别是在Arxiv这个数据集上，简单将Sentence-BERT与RevGAT级联起来，就可以达到接近GLEM的性能，甚至超过了做了自监督训练的GIANT。注意，这并不是因为用了一个参数量更大的语言模型，这里使用的Sentence-BERT为MiniLM版本，甚至比GIANT使用的BERT参数量更小。这里可能的一个原因是基于Natural Language Inference(NLI)这个任务训练的Sentence-BERT提供了隐式的结构信息，从形式上来说NLI与link prediction的形式也有一些相似。当然，这还只是非常初步的猜想，具体的结论还需要进一步探究。另外，从这一结果也给了一些启发，比如考虑图上的预训练模型时，能不能直接预训练一个语言模型，通过语言模型预训练更加成熟的解决方案，是不是还可以获得比预训练GNN更好的效果。同时，OpenAI提供的收费嵌入模型在节点分类这个任务上相比开源模型的提升很小。
第三，相比于未经微调的Deberta，LLaMA能够取得更好的结果，但是与句子嵌入这一类的模型还是有不小的差距。这说明相比于模型的参数大小，可能模型的种类是更重要的考量。对于Deberta，本文采用的是[CLS]作为句子向量。对于LLaMA，本文使用了langchain中的llama-cpp-embedding，它的实现中采用了[EOS]作为句子向量。在之前的相关研究中，已经有一些工作说明了为什么[CLS]在未经微调时性能很差，主要是由于其本身的各项异性，导致很差的可分性。经过实验，在高样本率的情况下，LLaMA生成的文本嵌入可以取得不错的下游任务性能，从侧面说明了模型的参数量增大可能可以一定程度上缓解这一问题。
对于特征级别的增强，本文得到了一些有意思的结果。但是，特征级别的增强还是需要语言模型是嵌入可见的。对于ChatGPT这类嵌入不可见的模型，可以使用文本级别的增强。对于这一部分，本文首先研究了一篇最近挂在Arxiv上的文章Explanation as features(TAPE)，其思想是利用LLM生成的对于预测的解释作为增强的属性，并通过集成的方法在OGB Arxiv的榜单上排到了第一名的位置。另外，本文也提出了一种利用LLM进行知识增强的手段Knowledge-Enhanced Augmentation(KEA)，其核心思想是把LLM作为知识库，发掘出文本中与知识相关的关键信息，然后生成更为详尽的解释，主要是为了不足参数量较小的语言模型本身知识信息的不足。两种模型的示意图如下所示。
为了测试两种方法的有效性，本文沿用了第一部分的实验设定。同时，考虑到使用LLM的成本，本文在Cora与Pubmed两个小图上进行了实验。对于LLM，我们选用了gpt-3.5-turbo，也就是大家所熟知的ChatGPT。首先，为了更好地理解如何进行文本级别的增强以及TAPE的有效性，我们针对TAPE进行了详细的消融实验。
在消融实验中，我们主要考虑了以下几个问题
从实验结果可以看到，伪标签非常依赖于LLM本身的zero shot预测能力（会在下一章详细讨论），在低样本率时，可能反而会拖累集成后的性能。因此，在后续的实验中，本文只使用原始属性TA与解释E。其次，句子编码相比于微调预训练模型，可以在低标注率下取得更好的效果，因此本文采用句子编码模型e5。除此以外，一个有趣的现象是在Pubmed数据集上，当使用了增强后的特征，基于微调的方法可以取得非常好的性能。一种可能的解释是模型主要是学到了LLM预测结果的“捷径”(shortcut)，因此TAPE的性能会与LLM本身的预测准确率高度相关。接下来，我们比较TAPE与KEA之间的有效性。
实验结果中，KEA与TAPE相比原始特征都有一定的提升。其中，KEA在Cora上可以取得更好的效果，而TAPE在Pubmed上更为有效。经过下一章的讨论后，会发现这与LLM在Pubmed上本身就有良好的预测性能有关。相比于TAPE，由于KEA不依赖LLM的预测，所以在不同数据集上的表现会更稳定一些。超越这两个数据集之外，这种文本级别的增强还有更多的应用场景。像BERT或者T5这一类比较小的预训练语言模型，往往不具备ChatGPT级别的推理能力，同时也没有办法像ChatGPT那样对不同领域的诸如代码、格式化文本有良好的理解能力。因此，在涉及到这些场景的问题时，可以通过ChatGPT这类大模型对原有的内容进行转换。在转换过后的数据上训练一个较小的模型可以有更快的推理速度与更低的推理成本。同时，如果本身也有一定量的标注样本，通过微调的方式会比上下文学习更好地掌握数据集中的一些个性化信息。
在这一部分，本文进一步考虑能否抛弃GNN，通过设计prompt来让LLM生成有效的预测。由于本文主要考虑的是节点分类任务，因此一个简单的基线是把节点分类看作是文本分类任务来处理。基于这个想法，本文首先设计了一些简单的prompt来测试LLM在不使用任何图结构的情况下能有多少性能。 本文主要考虑了zero shot, few shot,并且测试了使用思维链Chain of thought的效果。
实验结果如下图所示。LLM在不同的数据集上的性能差异非常大。在Pubmed数据集上，可以看到LLM在zero shot情况下的性能甚至超过了GNN。而在Cora,Arxiv等数据集上，又与GNN有较大的差距。注意，对于这里的GNN，在Cora，CiteSeer，Pubmed上，每一类有20个样本被选为训练集，而Arxiv与Products数据集上有更多的训练样本。相比之下，LLM的预测是基于零样本或者少样本的，而GNN并不具备零样本学习的能力，在少样本的情况下性能也会很差。当然，输入长度的限制也使得LLM无法囊括更多的上下文样本。
通过对实验结果进行分析，在某些情况下LLM预测错的结果也是比较合理的。一个例子如图12所示。可以看到，很多论文本身也是交叉领域的，因此预测时LLM通过自身的常识性信息进行推理，有时并不能与标注的偏好匹配到一起。这也是值得思考的问题：这种单标签的设定是合理的吗？
此外，在Arxiv数据集上LLM的表现最差，这与TAPE中的结论并不一致，因此需要比较一下两者的prompt有什么差异。TAPE使用的prompt如下所示。
Abstract: <abstract text> \n Title: <title text> \n Question: Which arXiv CS sub-categorydoes this paper belong to? Give 5 likely arXiv CS sub-categories as a comma-separated list ordered from most to least likely, in the form “cs.XX”, and provide your reasoning. \n \n Answer:
有意思的是，TAPE甚至都没有在prompt中指明数据集中存在哪些类别，而是直接利用了LLM中存在的关于arxiv的知识信息。奇怪的是，通过这个小变化，LLM预测的性能有巨大的改变，这不禁让人怀疑与本身测试集标签泄漏有关。作为高质量的语料，arxiv上的数据大概率是被包含在了各种LLM的预训练之中，而TAPE的prompt可能使得LLM可以更好地回忆起这些预训练语料。这提醒我们需要重新思考评估的合理性，因为这时的准确率可能反映的并不是prompt的好坏与语言模型的能力，而仅仅只是LLM的记忆问题。以上两个问题都与数据集的评估有关，是非常有价值的未来方向。
进一步地，本文也考虑了能否在prompt中通过文本的形式把结构信息也包含进来。本文测试了几种方式来在prompt中表示结构化的信息。具体地，我们尝试了使用自然语言“连接”来表示边关系以及通过总结周围邻居节点的信息来隐式表达边关系。
结果表明，以下这种隐式表达的方式最为有效。
Paper:<paper content>
NeighborSummary:<Neighborsummary>
Instruction:<Task instruction>
具体来说，模仿GNN的思路，对二阶邻居节点进行采样，然后将对应的文本内容输入到LLM中，让其进行一个总结，作为结构相关信息，一个样例如图13所示。
本文在几个数据集上测试了prompt的有效性，结果如图14所示。在除了Pubmed以外的其他四个数据集上，都可以相对不考虑结构的情况获得一定的提升，反映了方法的有效性。进一步地，本文分析了这个prompt为什么在Pubmed数据集上失效。
在Pubmed数据集上，很多情况下样本的标注会直接出现在样本的文本属性中。一个例子如下所示。由于这个特性的存在，想要在Pubmed数据集上取得比较好的结果，可以通过学习到这种“捷径”，而LLM在此数据集上特别好的表现可能也正源于此。在这种情况下，如果加上总结后的邻居信息，可能反而会使得LLM更难捕捉到这种“捷径”信息，因此性能会下降。
Title: Predictive power of sequential measures of albuminuria for progression to ESRD or death in Pima Indians with type 2 diabetes. … (content omitted here)
Ground truth label: Diabetes Mellitus Type 2
进一步地，在一些邻居与本身标签不同的异配(heterophilous)点上，LLM同GNN一样会受到邻居信息的干扰，从而输出错误的预测。
GNN的异配性也是一个很有意思的研究方向，大家也可以参考我们的论文Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?
从上文的讨论中可以看到，在一些情况下LLM可以取得良好的零样本预测性能，这使得它有代替人工为样本生成标注的潜力。本文初步探索了利用LLM生成标注，然后用这些标注训练GNN的可能性。
针对这一问题，有两个需要研究的点
最后，简要讨论一下本文的局限性，以及一些有意思的后续方向。首先，需要说明的是本文主要针对的还是节点分类这个任务，而这个pipeline要扩展到更多的图学习任务上还需要更多的研究，从这个角度来说标题或许也有一些overclaim。另外，也有一些场景下无法获取有效的节点属性。比如，金融交易网络中，很多情况下用户节点是匿名的，这时如何构造能够让LLM理解的有意义的prompt就成为了新的挑战。
其次，如何降低LLM的使用成本也是一个值得考虑的问题。在文中，讨论了利用LLM进行增强，而这种增强需要使用每个节点作为输入，如果有N个节点，那就需要与LLM有N次交互，有很高的使用成本。在实验过程中，我们也尝试了像Vicuna这类开源的模型，但是生成的内容质量相比ChatGPT还是相去甚远。另外，基于API对ChatGPT进行调用目前也无法批处理化，所以效率也很低。如何在保证性能的情况下降低成本并提升效率，也是值得思考的问题。
最后，一个重要的问题就是LLM的评估。在文中，已经讨论了可能存在的测试集泄漏问题以及单标注设定不合理的问题。要解决第一个问题，一个简单的想法是使用不在大模型预训练语料范围内的数据，但这也需要我们不断地更新数据集并且生成正确的人工标注。对于第二个问题，一个可能的解决办法是使用多标签(multi label)的设定。对于类似arxiv的论文分类数据集，可以通过arxiv本身的类别生成高质量的多标签标注，但对更一般的情况，如何生成正确的标注还是一个难以解决的问题。
[1] Zhao J, Qu M, Li C, et al. Learning on large-scale text-attributed graphs via variational inference[J]. arXiv preprint arXiv:2210.14709, 2022.
[2] Chien E, Chang W C, Hsieh C J, et al. Node feature extraction by self-supervised multi-scale neighborhood prediction[J]. arXiv preprint arXiv:2111.00064, 2021.
[3] He X, Bresson X, Laurent T, et al. Explanations as Features: LLM-Based Features for Text-Attributed Graphs[J]. arXiv preprint arXiv:2305.19523, 2023.
]]>Graph is the fundamental data structure that denotes pairwise relationships between entities across various domains, e.g., web, gene, and molecule. Machine learning on graph, typical on Graph Neural Network, becomes more and more popular in recent years. In this blog, we will introduce some basic concepts of machine learning on graph. We hope it may give you inspiration on:
what is graph? why do we need graph? How to solve graph-related problems with machine learning techniques?
How to correlate your specific task with the graph and view it as a graph problem?
How to utilize existing techniques to solve your specific task?
Before going deep into the technical details, we first provide some motivations by introducing some histories on the developement of graph Neural Network (GNN). The history of GNN is emerged as a response to two significant challenges. The first challenge came from the data mining domain, where researchers were exploring ways to extend deep learning techniques to handle structured network data. Examples of such data include the World Wide Web, relational databases, and citation networks. The second challenge arose from the science domain, where researchers were attempting to apply deep learning techniques to practical science problems such as single-cell analysis, brain network analysis, and molecule property prediction. To meet these practical challenges, the GNN community has grown rapidly, with researchers collaborating across different fields beyond data mining.
The graph is a data formulation that is widely utilized to describe pairwise relations between nodes. Mathematically, a graph can be denoted as \(\mathcal{G}=\left \{\mathcal{V}, \mathcal{E} \right \}\). \(\mathcal{V}= \left \{v_1, v_2, \cdots, v_N \right \}\) is a set of \(N=\left | \mathcal{V} \right |\) nodes. \(\mathcal{E}= \left \{e_1, e_2, \cdots, e_M \right \}\) is a set of \(M=\left | \mathcal{E} \right |\) which describes the connections between nodes. \(e=(v_1, v_2)\) indicates there is an edge exists from node \(v_1\) to node \(v_2\). Moreover, nodes and edges can have corresponding features \(X_V\in \mathbb{R}^{N\times d}\), \(X_E\in \mathbb{R}^{M\times d}\), respectively.
The main advantage of the graph formulation is the universal representation ability.
Universal represents that graph can be a natural representation for arbitrary data. In the data mining domain, much data can be naturally represented as a graph. Examples are shown in Figure 1
Social network [1] can be represented as a graph. Each node represents one user. Each edge indicates that the relationship exits between two users, e.g., friendship, domestic relationship,
Transport Network [2] can be represented as a graph. Each node represents one station. Each edge indicates that a route exists between two stations.
Web Network [3] can be represented as a graph. Each node represents one web page. Each edge indicates that a hyperlink exists between two pages.
(a) Social Network |
(b) Transport Network |
(c) Web Network |
Moreover, the graph can also generalize into different domains. In the computer vision domain, The image can be viewed as a grid graph. In the natural language processing domain, the sentence can be viewed as a path graph. IN AI4Science, the graph can adapt to all scientific problems easily. More concrete examples are shown in Figure 2
Brain network [4] can be represented as a graph. Nodes represent brain regions, and edges represent connections between them. Connections can be structural, such as axonal projections, or functional, such as correlated activity between brain regions. Brain network graphs can be conducted with different scales, ranging from individual neurons and synapses to large-scale brain regions and networks.
Gene-gene network [5] can be represented as a graph. In a gene-gene network, nodes represent genes, and edges represent interactions between them. These interactions can be based on different types of experimental evidence, such as co-expression, co-regulation, or protein-protein interactions. Gene-gene networks can be conducted with different levels of complexity, from small subnetworks involved in specific biological pathways to large-scale networks that span the entire genome.
Molecule network [6] can be represented as a graph. chemical compounds are denoted as graphs with atoms as nodes and chemical bonds as edges. Molecular networks can be conducted with different levels of complexity, from simple compounds such as water and carbon dioxide to complex biomolecules such as proteins and DNA.
(a )Gene-gene Network |
(b) Brain Network |
(c) Molecule Network |
The simple graph mentioned in Section [1.1] shows the most basic formulation of the graph which only takes single node and edge type into consideration. However, different data may have additional features which cannot be easily handled on the single graph formulation.
In this subsection, we will briefly describe popular complex graphs including the heterogeneous graph, bipartite graph, multidimensional graph, signed graph, hypergraph, and dynamic graph.
The bipartite graph formulation is a special single graph where edges can only between two node sets \(\mathcal{V}_1\) and \(\mathcal{V}_2\). Two node sets should have: (1) no overlap between two node sets: \(\mathcal{V}_1 \cap \mathcal{V}_2 = \emptyset\). (2) contains all nodes: \(\mathcal{V}_1 \cup \mathcal{V}_2 = \mathcal{V}\). The bipartite graph is utilized to describe the interactions between different objectives. It is typically utilized in the e-commerce system to describe the interaction between users and documents. It can also be utilized on different science problems.
The signed graph is introduced to describe the graph with two edge types: positive edges and negative edges. A signed graph \(\mathcal{G}\) consists of a set of nodes \(\mathcal{V}=\{v_1, \cdots, v_N \}\) and a set of edges \(\mathcal{E}=\{e_1, \cdots, e_M \}\). Additionally, there is an edge-type mapping function \(\phi_e:\mathcal{E}\to\mathcal{T}_e\) that map each edge to their types, positive or negative. \(\mathcal{T}_e = \left \{1, -1 \right \}\) indicate the edge type, positive or negative. It is typically utilized in social networks like Twitter, where the positive edge indicates following, and the negative edge indicates block or unfollow. It can also be utilized on different science problems.
The heterogeneous graph introduced more node types on the graph. New relationship types can also be found as edges can be found between different node types.
For example, the simple citation network can be represented with the single graph formulation, where each node represents a paper, each edge represents one paper cites another one. However, the citation network can be more complex when considering: (1) authors. authors could have a co-author relationship. The author could also write papers. (2) Paper types. Paper can have different types, e.g., Data Mining, Artificial Intelligence, Computer Vision, and Natural Language Processing.
A Heterogeneous graph \(\mathcal{G}\) consists of a set of nodes \(\mathcal{V}=\{v_1, \cdots, v_N \}\) and a set of edges \(\mathcal{E}=\{e_1, \cdots, e_M \}\). Additionally, there are two mapping functions \(\phi_n:\mathcal{V}\to\mathcal{T}_n\), \(\phi_e:\mathcal{E}\to\mathcal{T}_e\) that map each node and each edge to their types, respectively. \(\mathcal{T}_e\) indicate the set of node an edge type.
Multidimensional graph is introduced to describe multiple relationships that simultaneously exist between a pair of nodes. It is different from the signed graph and the heterogeneous graph that both of them do not allow multiple edges between a pair of nodes. A multidimensional graph consists of a set of \(N\) nodes \(\mathcal{V}= \{v_1, \cdots, v_n \}\) and \(D\) sets of edges \(\{\mathcal{E}_1, \cdots, \mathcal{E}_D \}\). Each edge set \(\mathcal{E}_d\) describes the \(d\)th type of relation between nodes. The intersection between different edge sets is allowed. It is typically utilized in the social network. Users can "like", "Retweet", and "comment" on the tweet. Each action corresponds to one relationship between user and tweet. It can also be utilized on different science problems.
The hypergraph is introduced when you are required to consider the relationship beyond a pair of nodes. A hypergraph \(\mathcal{G}\) consists of a set of \(N\) nodes \(\mathcal{V}= \{v_1, \cdots, v_n \}\) and a set of hyperedges \(\mathcal{E}\). The incident matrix \(\mathbf{H} \in \mathbb{R}^{|\mathcal{V}|\times |\mathcal{E}| }\) instead of using the adjacent matrix \(\mathbf{A}\) is utilized to describe the graph structure.
\[H_{i j} = \begin{cases} 1 & \text{if vertex } v_{i} \text{ is incident with edge } e_{j} \\ 0 & \text{otherwise.} \end{cases} \tag{1}\]It is typically utilized in the academic network. where nodes are papers and authors. One author can publish more than one paper which can be viewed as a hyper-edge connecting multiple papers.
Dynamic graph is introduced when the graph constantly evolves where new nodes and edges may be added and some existing nodes and edges may disappear in the graph. A dynamic graph \(\mathcal{G}\) consists of a set of \(N\) nodes \(\mathcal{V}= \{v_1, \cdots, v_n \}\) and a set of edges \(\mathcal{E}\) where each node and edge is associated with a timestamp indicating the time it emerged. We have two mapping functions \(\phi_v\), and \(\phi_e\) mapping each node and each edge to the timestamps, respectively. It is typically utilized in the social network, where nodes are users on Twitter. There are new users every day and they can follow and unfollow other users from time to time.
Knowledge Graph is an important application on the graph domain. It is comprised of nodes and edges, where nodes \(\mathcal{V}\) represent entities (such as people, places, or objects) and edges \(\mathcal{E}\) represent relationships \(\mathcal{R}\) between these entities. These relationships can be diverse, including semantic relations (e.g., "is a" or "part of"), factual associations (e.g., "born in" or "works at"), or other contextual links. The graph-based structure allows for efficient querying and traversing of data, as well as the ability to infer new knowledge by leveraging existing connections.
A Knowledge Graph is a structured representation of information that aims to model the relationships between entities, facts, and concepts in a comprehensive and interconnected way. It provides a flexible and efficient means of organizing, querying, and deriving insights from large volumes of data, making it a powerful tool for information retrieval and knowledge discovery. It is widely utilized in the Semantic web which enables machines to better understand and interact with web content by organizing information in a machine-readable format.
Remark: In this subsection, we briefly introduce different graph formulations in this subsection. However, the real-world case could be more complicated. For example, The Network in E-commerce could be a Heterogeneous bipartite multidimensional graph. It typically corresponds to the following scenarios: (1) Heterogeneous: Customer and purchaser could be different user types. Different items also have different types. (2) bipartite: Users could only have interactions with the items. (3) multidimensional: Users could have different interactions on the items, e.g., "buy" "add to shopping cart", and so on.
The graph formulations described in this subsection are more like prototypes. You can design the typical graph formulation for your data. It could be easy to learn from the recent progress on the corresponding graph type to your data.
In this subsection, we provide a brief introduction on the graph-related tasks to show how we can utilize the graph on different scenarios. We typically introduce node classification, graph classification, graph generation, link prediction tasks. Most downstream tasks can be viewed as an instance for the above tasks
Node classification aims to identify which class the graph node should belong by utilizing the ego feature, adjacent matrix, and features from other nodes. The node classification task has numerous real-world applications. Examples are as follows: (1) social network analysis: In social networks, nodes and edges represent each individual and social relationships. Node classification can be utilized to predict various attributes, such as interests, affiliation, profession and so on. (2) Bioinformatics: In biological networks, nodes represent genes, proteins, or other biological entities, and the connections between nodes represent interactions such as regulatory or metabolic relationships. Node classification can be utilized to predict various node properties, such as the function, localization, or disease association. (3) Cybersecurity: In network security, nodes represent computers, servers, or other network devices, and the connections between nodes represent communication or access relationships. Node classification can be utilized to detect various types of network attacks or anomalies, such as malware, spam, or intrusion attempts.
Graph classification aims to identify which class the graph should belong with exploiting both rich information from the graph structure and the node feature. Image classification can be viewed as a special case for the graph classification task. Each pixel can be viewed as a node, where RGB is the corresponding node feature. The graph structure on image is a grid which connects the adjacent pixels. Graph classification has been broadly utilized in many real-world applications. Examples are shown as follows. (1) bioinformatics: The graph classification can be utilized to identify biological networks into different categories. For example, we could classify a set of protein-protein interaction networks based on their function or disease association. It can help identify potential drug targets, protein complexes, or pathways, and inform drug discovery. (2) chemistry: The graph classification can be utilized to identify chemical compounds into different categories. For example, we could classify a set of compounds based on their toxicity or therapeutic potential. (3) Social Network Analysis: Graph classification can be utilized to identify the discussion topic of a tweet in Twitter.
Link prediction can be viewed as a binary classification task predicting whether there is a link exists between two nodes on the graph. It could complete the graph and find the under-discovered relationship between nodes. Link prediction has been broadly utilized in many domains. Examples are shown as follows: (1) Friend recommendation in the social network. Twitter could recommend you some friends you may know or interested in. (2) Movie recommendation. Netflix will recommend you the film you may be interest in. (3) bioinformatics: In biological networks, link prediction can be utilized to predict the likelihood of physical interactions between pairs of proteins based on their sequence similarity, domain composition, or other features. It can help identify potential drug targets, protein complexes, or pathways, and inform drug discovery.
In contrast to the aforementioned tasks, graph generation aims to solve the generative problem: given a dataset of graphs, learn to sample new graphs from the learned data distribution. As graph could represent many highly-structured data, graph generation has the promises for design tasks in a variety of domains such as molecular graph generation (drug & materials discovery), circuit network design, in-door layout design, etc.
In this section, we aim to introduce (1) the Graph Neural Networks which have become popular for learning graph representations by jointly leveraging attribute and graph structure information. (2) understanding perspectives on GNN which connect GNN design to other domains, e.g., graph signal process, Weisfeiler-Lehman Isomorphism Test, and so on (3) traditional graph machine learning methods and structure-agnostic methods which may perform even better than GNN
The design of Graph Neural Network is inspired from the Convolution Neural Network which is one of the most widely-used Neural Networks in the computer vision domain. It takes effort to utilize the neighborhood pixel to learn a good representation. Concretely speaking, convolutional Neural Networks extract different feature patterns by aggregating the neighboring pixels in a fixed-size receptive field, for example, a receptive field with \(3\times 3\) neighborhood pixels. To extend the superiority of CNN to the graph, researchers develop the Graph Neural Network. There are two essential problems in developing the Graph Neural network.
How to define the receptive field on graph since it is not a regular grid?
What feature patterns are useful on the graph?
Those two questions lead to two crucial perspectives on designing Graph Neural Networks, spectral and spatial perspectives, respectively. Before going into the details in those details, we first provide a definition of the general Graph Neural Network Framework.
We introduce the general frameworks of GNNs for the most basic node-level task. We first recap some notations on the graph. We denote a graph as \(\mathcal{G}= \left \{ \mathcal{V}, \mathcal{E} \right \}\) (i.e. molecule). The adjacent matrix and the associated features are denoted as \(\mathbf{A}\in \mathbb{R}^{N \times N}\) (i.e. bond type) and \(\mathbf{F}\in \mathbb{R}^{N \times d}\) (i.e. atom type), respectively. \(N\) and \(d\) are the numbers of nodes and feature dimensions, respectively.
A general framework for Graph Neural Networks can be regarded as a composition of \(L\) graph filter layers, and \(L-1\) nonlinear activation layers. \(h_i\) and \(\alpha_i\) are utilized to denote the \(i\)-th graph filter layer, and activation layer, respectively. \(\mathbf{F}_i \in \mathbb{R}^{N\times d_i}\) denotes the output of the \(i\)-th graph filter layer \(h_i\). \(\mathbf{F}_0\) is initialized to be the raw node features \(\mathbf{F}\).
For the image with a regular grid structure, the receptive fields are defined as the neighborhood pixel around the central pixel. An example is illustrated in Fig. . So how to define the receptive field on the graph with no unified regular structure? The answer is the neighborhood nodes along the edge. One hop neighborhood of node \(v_i\) can be defined as \(\mathcal{N}_{v_i} = \left \{ v_j s.t., (v_i, v_j) \in \mathcal{E}\right \}\). To adaptively extract the neighborhood information, a large variety of spatial-based graph filters are proposed. We introduce two typical spatial Graph-filter layers, GraphSAGE and GAT, in this section.
GraphSAGE [7]: The GraphSAGE model proposed in () introduced a spatial-based filter that aggregation information from neighboring nodes. The hidden feature for node \(v_i\) is generated with the following steps.
Sample neighborhood nodes from the neighborhood set. \(\mathcal{N}_S(v_i)=\text{SAMPLE}(\mathcal{N}(v_i), S)\) where \(\text{SAMPLE}()\) is a function that takes the neighborhood set as input, and random sample \(S\) instances as the output.
Extract the information from neighborhood nodes. \(f_i' = \text{AGGREGATE}( \left \{ \mathbf{F}_j, \forall v_j \in \mathcal{N}_S(v_i) \right \} )\) where \(\text{AGGREGATE}: \mathbb{R}^{M\times d} \to \mathbb{R}^{d}\) is a function to combine the information from the neighboring nodes.
combine the neighborhood information with the ego information \(\mathbf{F}_i=\sigma \left ( [\mathbf{F}_i, \mathbf{f}'_i] \mathbf{\Theta} \right )\) where \([\cdot, \cdot]\) is the concatenation operation, \(\Theta\) is the learnable parameters.
The aggregation can be a set function with different aggregators including mean, maximum aggregators, which takes the element-wise mean, and maximum operator. sum aggregator is later introduced by () with stronger expressive ability.
GAT [8]: The Graph Attention Network (GAT) is inspired by the self-attention mechanism. GAT adaptively aggregates the neighborhood information based on the attention score. The hidden feature for node \(v_i\) is generated with the following steps.
generates the attention score with the neighborhood node. \(a(\mathbf{F}_i\mathbf{\Theta}, \mathbf{F}_j\mathbf{\Theta})=\text{LeakyReLU} (\mathbf{a}^T \left [ \mathbf{F}_i\mathbf{\Theta}, \mathbf{F}_j\mathbf{\Theta} \right ]) \text{s.t.},v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}\) where \(a\) is a
normalizes the attention score via softmax. \(\alpha_{ij} = \frac{\exp{e_{ij}}}{\sum_{v_k \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \exp{e_{ik}}}\)
aggregation the weighted information from neighborhoods. \(\mathbf{F}'_i = \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \alpha_{ij}\mathbf{F}_i\mathbf{\Theta}\)
multi-head attention implementation. \(\mathbf{F}'_i = ||_{m=1}^M \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \alpha_{ij}^m \mathbf{F}_j \mathbf{\Theta}^m\) Where \(||\) is the concatenation operator, \(M\) is the number of heads.
Notice that, the key difference between the GAT and self-attention mechanism is that, self-attention is conducted on all the nodes, where the GAT is conducted on the neighborhood nodes. More discussion can be found in the next section.
Spectral-based Graph Filters majorly utilize the spectral graph theory to develop the filter operation in the spectral domain. We will only provide some motivations for Spectral-based Graph Filters without mathematical details.
The motivation behind spectral graph filters is that neighboring nodes in a graph should have similar representations. In the context of spectral graph theory and filters, neighborhood similarity corresponds to the low-frequency components which changes in the graph structure that occur slowly or gradually. Contrastively, high-frequency components correspond to rapid or abrupt changes. By focusing on the low-frequency components, spectral graph filters can capture the underlying smooth variations in the graph topology, which can be useful for various tasks e.g., node classification, link prediction, and graph clustering.
In other words, spectral graph filters aim to identify feature patterns that are smooth and do not vary significantly across different nodes. It corresponds to the low-frequency components of the graph structure based on spectral graph theory.
GCN [9]: We only provide a brief introduction on the formulation of the Graph Convolutional Network (GCN). A more comprehensive study can be found in Section 5.3.1 of Deep Learning on Graphs [10].
The aggregation function of GCN is defined as: \(\mathbf{F}'= \sigma( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{F}\mathbf{\Theta}) \tag{2}\) where \(\sigma\) is the activation function, \(\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\) is the symmetric normalized adjacent matrix.
The aggregation function for each edge can be defined as: \(\mathbf{F}'_i = \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \frac{1}{\sqrt{\tilde{d}_i\tilde{d}_j}} \mathbf{F}_j\mathbf{\Theta}\tag{3}\) where \(\tilde{d}_i\) is the degree of node \(i\).
The above discussion focused on GNN design for the simple graph with a single node and edge type. Message Passing Neural Network (MPNN) is then proposed as a more general framework that could cover the entire design space for GNNs. Concretely speaking, MPNNs are a family of neural networks that operate on graphs by (1) generating messages between nodes based on their local neighborhoods. (2) aggregating messages from neighboring nodes iteratively to MPNNs can learn powerful graph representations for various downstream tasks.
The above discussion focuses on the GNN design on the simple graph with single node and edge type. Message Passing Neural Network is a re A more general Graph Neural Network. Message passing Neural Network is a more general framework which could cover the whole design space for GNN.
The Message Passing filter can be defined as: \(h_{i}^{\ell+1}=\phi\left(h_{i}^{\ell}, \oplus_{j \in \mathcal{N}_{i}}\left(\psi\left(h_{i}^{\ell}, h_{j}^{\ell}, e_{i j}\right)\right)\right)\tag{4}\) where \(\phi\), \(\psi\) are Multi-Layer Perceptrons (MLPs), and \(\oplus\) is a permutation-invariant local neighborhood aggregation function such as summation, maximization, or averaging.
Focusing on one particular node \(i\), the MPNN layer can be decomposed into three steps as:
Message: For each pair of linked nodes \(i\), \(j\), the network first computes a message \(m_{i j}=\psi\left(h_{i}^{\ell}, h_{j}^{\ell}, e_{i j}\right)\) The MLP \(\psi: \mathbb{R}^{2d+d_e}\to \mathbb{R}^{d}\) takes as input the concatenation of the feature vectors from the source node, target node, and edge feature.
Aggregate: At each source node \(i\), the incoming messages from all its neighbors (target node) are then aggregated as \(m_{i}=\oplus_{j \in \mathcal{N}_{i}}\left(m_{i j}\right)\)
Update: Finally, the network updates the node feature vector \(h_{i}^{\ell+1}=\phi\left(h_{i}^{\ell}, m_{i}\right)\) by concatenating the aggregated message \(m_i\) and the previous node feature vector \(h_i^{\mathcal{l}}\), and passing them through an MLP \(\phi: \mathbb{R}^{2 d} \rightarrow \mathbb{R}^{d}\).
A function \(f\) is said to be equivariant if for any transformation \(\tau\) of the input space \(X\), and any input \(x\in X\), we have: \(f(\tau(x)) = \tau(f(x))\). In other words, applying the transformation \(\tau\) to the input has the same effect as applying it to the output. A function \(f\) is said to be invariant if for any transformation \(\tau\) of the input space \(X\), and any input \(x\in X\), we have: \(f(\tau(x)) = f(x)\). In other words, applying the transformation \(\tau\) to the input does not change the output.
In the context of GNNs, we want to achieve permutation-equivariance or permutation-invariance, which means that the function should be equivariant or invariant to permutations of the input graph. We can express this mathematically by defining a permutation \(\sigma\) of the nodes of the input graph \(G=(V,E)\), and requiring that the output of the GNN is the same regardless of the permutation: \(f(G) = f(\sigma(G))\), where \(\sigma(G)\) is the graph obtained by applying the permutation \(\sigma\) to the nodes of \(G\).
The expressiveness of Graph Neural Network is highly related with the graph isomorphism test. An expressive GNN should map the isomorphic graphs into the same representation and distinguish non-isomorphic graphs with different representations.
The Weisfeiler-Lehman (WL) test is a popular graph isomorphism test used to determine whether two graphs are isomorphic, meaning two graphs have the same underlying structure but may differ in the node labeling. The intuition for WL-test is that if two graphs are isomorphic, then their structures should be similar across all hops of neighborhoods, from one-hop neighborhoods to the global structure of the entire graph. The algorithm iterates on the following two steps: (1) aggregation: collect a set of neighbor node labels (2) labeling: assigned a new label based on the label set of neighbor nodes. The WL-test will repeat this labeling and aggregation process until convergence (node label does not change). We can then identify whether two graphs are isomorphic if they have the same sequence of refined graphs or not. The WL-test is widely utilized in different domains since it is efficient with the time complexity \(O(n \log (n))\), where \(n\) is the number of the nodes. More recently, the WL-test is widely utilized for analyzing the expressiveness of GNN.
Graph Neural Networks and Transformer architectures are typically two popular model architectures to leverage the context information. Connections can be found between those two architectures.
\(\begin{array}{c} h_{i}^{\ell+1}=\operatorname{Attention}\left(Q^{\ell} h_{i}^{\ell}, K^{\ell} h_{j}^{\ell}, V^{\ell} h_{j}^{\ell}\right), \\ i . e ., h_{i}^{\ell+1}=\sum_{j \in \mathcal{S}} w_{i j}\left(V^{\ell} h_{j}^{\ell}\right), \end{array}\) where \(w_{i j}=\operatorname{softmax}_{j}\left(Q^{\ell} h_{i}^{\ell} \cdot K^{\ell} h_{j}^{\ell}\right)\). \(j\in \mathcal{S}\) denotes the set of words in the sentence \(\mathcal{S}\) and \(Q^{\mathcal{l}}, K^{\mathcal{l}}, V^{\mathcal{l}}\) are learnable linear weights. Three matrices denote the Query, Key, and Value for the attention respectively. One update on each word embedding can be viewed as a weighted aggregation of all word embeddings in the sentence. An illustration of self-attention block in Transformer is shown in Fig. 4(b)
One Graph Neural Network block can be defined as follows: \(h_{i}^{\ell+1}=\sigma\left(U^{\ell} h_{i}^{\ell}+\sum_{j \in \mathcal{N}(i)}\left(V^{\ell} h_{j}^{\ell}\right)\right),\tag{5}\) where \(U^{\mathcal{l}}, V^{\mathcal{l}}\) are learnable transformation matrices of the GNN layer and \(\sigma\) is the non-linearity activation function. One update for the hidden representation \(h_i\) of node \(i\)at layer \(\mathcal{l}\) be viewed as a weighted aggregation of neighborhood nodes representation \(j\in \mathcal{N}(i)\).
An illustration of GNN block is shown in Fig. 4(a)
(a) GNN block |
(b) Transformer block |
The key difference between Graph Neural Network and transformer is that Graph Neural Network only aggregates on the neighborhood nodes, while Transformer will aggregate on all the words in the sentence. In another word, Transformer can be viewed as a GNN aggregated on a fully-connected word graph. In other words, both Graph Neural Network and Transformer aim to learn good representation by incorporating context information. Transformer recognizes all the words in one sentence are useful while GNN only recognizes that the neighborhood nodes are useful.
Graph signal denoising [12] offers a new perspective to create a uniform understanding on representative aggregation operations.
The graph signal denoising is to recover a clean signal from the original noisy signal. It can be defined as solving the following optimization problem: \(arg \min_F \mathcal{L}=||F-S||_F^2 + c \cdot \text{tr}(F^TLF) \tag{6}\) where \(S\in \mathbb{R}^{N\times d}\) is a noisy signal (input feature) on graph \(\mathcal{G}\). \(F\in \mathbb{R}^{N\times d}\) is the clean signal assumed to be smooth over \(\mathcal{G}\).
The first term guides \(F\) to be close to \(S\), while the second term \(tr(F^TLF)\) is the Laplacian regularization which guides \(F\)’s smoothness over \(\mathcal{G}\), with \(c > 0\)’s mediation. Assuming that we adopt the unnormalized version of Laplacian matrix with \(L = D - A\) (the adjacency matrix \(A\) is assumed to be binary), the second term can be written in an edge-centric way as: \(c \sum_{(i,j)\in \mathcal{E}} ||F_i-F_j||_2^2\tag{7}\) which leads to the connected nodes sharing similar features.
We show the connection between the graph signal process and GCN as an
example here. The gradient with respect to \(F\) at \(S\) is
\(\frac{\partial \mathcal{L}}{\partial F} \|_{F = S} = 2cLS\)
Hence, one-step gradient descent for the graph signal denoising
problem equation [8] can be described as:
\begin{aligned}
F \leftarrow S - b\left. \frac{\partial \mathcal{L}}{\partial F} \right|_{F = X} &= S - 2bcLS \nonumber
&= (1-2bc )S+ 2bc\tilde{A}S.
\end{aligned}
When stepsize \(b=\frac{1}{2c}\) and \({ S}={ X}'\),
we have \(F \leftarrow \tilde{A}X'\), which is the same as
the aggregation operation of GCN. It provides a new perspective to
understand existing GNNs as a tradeoff between the original feature
preservation and neighborhood smoothness. Moreover, it inspires us to
derive new Graph Neural Networks from different graph signal processing
methods.
A new physical-inspired perspective is to understand Graph Neural Network as a discrete dynamical system of particle [13]. Each node on the graph corresponds to one particle while the edge represents pair-wise interactions between nodes. The positive and negative interactions between nodes could be interpreted as attraction and repulsion between particles, respectively.
To view Graph Neural Network as a discrete dynamical system, one can correspond the input forward layer by layer as the input evolution by a system of differential equations. Each discrete time step in the dynamic system corresponds to one layer forward process.
Gradient flow is a special type of evolution equation of the form \(f(X(t))=- \nabla \mathcal{E}(X(t))\tag{9}\) where \(\mathcal{E}\) is an energy functional, which could be different for different GNNs. The gradient flow makes \(\mathcal{E}\) monotonically decrease during the evolution.
Simple GNN can be viewed as the gradient flow of the Dirichlet energy \(\mathcal{E}^{\text{DIR}} =\frac{1}{2} \text{trace}(X^TLX)\tag{10}\) The Dirichlet energy measures the smoothness of the features on the graph. In the limit \(t\to \infty\), all node features is extremely smooth that all the nodes become the same. It indicates that the system loses the information contained in the input features. This phenomenon is called ‘oversmoothing’ in the GNN literature.
To design better Graph Neural Network to overcome drawback like oversmooth, we can parametrise an energy and deriving a GNN as its discretised gradient flow. It offers better interpretability and leads to more effective architectures.
Dynamic programming on graphs is a technique that involves solving problems by breaking them down into smaller subproblems and finding optimal solutions to those subproblems. This approach can be used to solve a wide range of problems on graphs, including shortest path problems, maximum flow problems, and minimum spanning tree problems. Such an approach shares the similar idea with the aggregation operation on GNN which recursively combines information from neighboring nodes to update the representation of a given node. Both GNN aggregation and dynamic programming on graphs involve combining information from neighboring nodes to update the representation of a given node. In dynamic programming, the combination of information is typically done by recursively solving subproblems and building up a solution to a larger problem. Similarly, in GNN aggregation, neighboring node information is combined through various aggregation functions (e.g. mean, max, sum), and the updated node representation is then passed to subsequent layers in the network. In both cases, the goal is to efficiently compute a global solution by leveraging local information from neighboring nodes. However, vanilla GNNs cannot solve most dynamic programming problems, e.g., shortest path algorithm, and generalized Bellman-Ford algorithm, without capturing the underlying logic and structure of the corresponding problem. To empower GNN with the reasoning ability in dynamic programming, multiple operators are then proposed to generalize the operation in dynamic programming to the Neural Network, e.g., the sum generalizes to a commutative summation operator \(\oplus\), the product \(\otimes\) generalizes to a Hadamard product operator. GNNs can then be extended with different dynamic programming algorithms with improving generalization ability. A simple example of the Graph Neural Network extending to the Bellman-Ford algorithm can be found in Figure. 4
Graph Neural Network is well-recognized as a powerful method for machine learning on graph. However, GNN is still not the dominant method in the graph domain. Traditional machine learning methods on graph and non-graph methods still reveal advantages over the Graph Neural Network. They still hold an important position on graph research and inspire the design of the new Graph Neural Network. In this section, we will first introduce some important machine learning methods beyond graph including Graph Kernel methods for graph classification, label propagation for node classification, and heuristic methods for link prediction.
Label Propagation is a simple but effective method for node classification in graphs. It is a semi-supervised learning technique that leverages the idea that nodes that are connected in a graph are likely to share the same label or class. For example, it could be utilized to a network of people with two labels "interested in cricket" and "not interested in cricket". We only know the interests of a few people and we aim to predict the interests of the remaining unlabeled nodes.
The procedure of label propagation can be found as follows. \(A\) be the \(n \times n\) adjacency matrix of the graph, where \(A_{ij}\) is 1 if there is an edge between nodes \(i\) and \(j\), and 0 otherwise. Let \(Y\) be the \(n \times c\) matrix of node labels, where \(Y_{ij}\) is 1 if node \(i\) belongs to class \(j\), and 0 otherwise. Let \(F\) be the \(n \times c\) matrix of label distributions, where \(F_{ij}^{(t)}\) is the probability of node \(i\) belonging to class \(j\) at iteration \(t\).
At each iteration \(t\), the label distribution \(F^{(t)}\) is updated based on the label distributions of the neighboring nodes as follows:
\(F^{(t)}=AF^{(t-1)}D^{-1}\tag{11}\) where \(D\) is the diagonal degree matrix of the graph, where \(D_{ii} = \sum_j A_{ij}\).
After a certain number of iterations or when the label distributions converge, the labels of the nodes are assigned according to the label distribution with the highest probability:
\[Y_i = arg\max_j F^{(t)}_{ij}\tag{12}\]This process is repeated until the labels converge to a stable state or until a stopping criterion is met.
\(\hat{\mathbf{Y}}=(\mathbf{D}^{-1}\mathbf{A})^t\mathbf{Y}\tag{13}\) where \(\mathbf{D}\) and \(\mathbf{A}\) is the degree matrix and adjacent matrix, respectively. \(t\) is the number of propagation. \(\mathbf{Y}=\begin{bmatrix} \mathbf{Y}_l \\ \mathbf{0} \end{bmatrix}\) is the vector of labels on nodes. \(\mathbf{D}^{-1}\mathbf{A}\) is the transition matrix.
Graph Kernel method is to measure the similarity between two graphs with a kernel function which corresponds to an inner product in reproducing kernel Hilbert space (RKHS). Kernel methods are widely utilized in the Support Vector Machine. It allows us to model higher-order features in the original feature space without computing the coordinates of the data in a higher dimensional space. Graph kernel methods confront additional challenges than the general kernel methods on how to encode the similarity on the graph structure. The design of graph kernel methods focuses on finding suitable graph patterns to measure similarity. We will briefly introduce the subgraph pattern and path pattern on graph kernels.
Graph kernels based on subgraphs aims to find the same subgraph between graphs. Two graphs with more same subgraphs are more similar. Subgraph set can be defined by the graphlet, which is an induced and non-isomorphic sub-graph of node size-\(k\). An illustration can be found in Fig.3 A pattern count vector \(\mathbf{f}\) will be calculated where \(i^{\text{th}}\) component denotes the frequency of subgraph pattern \(i\) occurs.
The graph kernel can then be defined as: \(\mathcal{K}_{\text{GK}}(\mathcal{G}, \mathcal{G}')= \left \langle \mathbf{f}^{\mathcal{G}} \mathbf{f}^{\mathcal{G}'} \right \rangle\) where \(\mathcal{G}\) and \(\mathcal{G}'\) are two graph, \(\left \langle \cdot, \cdot \right \rangle\) denotes the Euclidean dot product.
Graph kernels based on path decomposes a graph into paths. It takes the co-occurrence of random-walk on two graphs to calculate the similarity. Different from the subgraph-based methods focusing on the graph structure, random-walk based method takes the node label in the graph into consideration. It counts all shortest paths in graph \(\mathcal{G}\) denoting as triplets \(p_i=(l_s^i, l_e^i, n_k )\). \(n_k\)is the length of the path. \(l_s^i\) and \(l_e^i\) are the labels of the starting and ending vertices, respectively.
Similarly, the graph kernel can be defined as: \(\mathcal{K}_{\text{GK}}(\mathcal{G}, \mathcal{G}')= \left \langle \mathbf{f}^{\mathcal{G}} \mathbf{f}^{\mathcal{G}'} \right \rangle\tag{15}\) where the \(i^{\text{th}}\) component of \(\mathbf{f}\) denotes the frequency of triplet occurring.
Heuristic methods, i.e., Common Neighbor, utilize the graph structure to estimate the likelihood of the existence of links. We will briefly introduce some basic heuristic methods including common neighbors, Jaccard score, preferential attachment, and Katz index. \(\Gamma(x)\) denote the neighbor node set of \(x\). \(x\) and \(y\) denote two different nodes.
Common Neighbors (CN): The Common Neighbors algorithm considers two nodes with more overlapping neighbor nodes are more likely to be connected. Common neighbors algorithm calculates the intersection between neighbor nodes of node \(x\) and node \(y\). \(f_{\text{CN}}(x,y)=| \Gamma(x) \cap \Gamma(y) |\tag{16}\)
Jaccard score: Jaccard score can be viewed as a normalized Common Neighbors algorithm, where the normalized factor is union of node sets. \(f_{\text{Jaccard}}(x,y)=\frac{| \Gamma(x) \cap \Gamma(y) |}{| \Gamma(x) \cup \Gamma(y) |}\tag{17}\)
Preferential attachment (PA): Preferential attachment algorithms consider that nodes with higher degrees are more likely to be connected. Preferential attachment calculates the product of node degrees. \(f_{\text{PA}}(x,y)=| \Gamma(x) | \times | \Gamma(y) |\tag{18}\)
Katz index Katz index algorithm takes high-order nodes into consideration compared with the above algorithms based on one hop neighborhood. Katz index considers that nodes with more short paths are more likely to be connected. It calculates the weighted sum of all the walks between \(x\) and \(y\) as follows: \(f_{\text{Katz}}(x,y)= \sum_{l=1}^{\infty}\beta^l |\text{walks}^{\left \langle l \right \rangle }(x,y)|\tag{19}\) \(\beta\) is a decaying factor between 0 and 1, which gives a smaller weight to distant path. \(|\text{walks}^{\left \langle l \right \rangle }\) counts the length between \(x\) and \(y\).
In this section, we will first introduce some general tips for applying graph machine learning in scientific discovery followed by two success examples in molecular science and social science.
If your task focuses on a single large graph, it may meet the out-of-memory issue. We suggest you (1) utilize sampling strategies (2) less propagation layer without involving too many neighbors.
If your task focuses on multiple small graphs, time efficiency may be an issue. (Seems that GNN can be very slow on mini-batch task)
feature matters: if your graph node does not have the feature, you can conduct the feature manually. Some suggested features are degree, Laplacian Eigenvector, DeepWalk embedding.
feature normalization may heavily influence the performance of GNN models.
add self-loop may provide additional gain to your model
The performance on single data split may not be reliable. Try different data splits for reliable performance.
If your data does not naturally have the graph structure, it may not be necessary to conduct graph structure manually to apply GNN methods on.
GNN is a permutation equivalence Neurel Network. It may not work well on tasks requiring other geometric properties and also nodes related to other information.
Algorithm 1:
Input: molecule, radius R, fingerprint length S
Initialize:
fingerprint vector f ← 0_S
r_a ← g(a)
r_1 ... r_N = neighbors(a)
v ← [r_a, r_1, ..., r_N]
r_a ← hash(v)
i ← mod(r_a, S)
f_i ← 1
Return: binary vector f
Algorithm 2:
Input: molecule, radius R, hidden weights H_1^1 ... H_R^5, output weights W_1 ... W_R
Initialize:
fingerprint vector f ← 0_S
r_a ← g(a)
r_1 ... r_N = neighbors(a)
v ← r_a + Σ_{i=1}^{N} r_i
r_a ← σ(vH_L^N)
i ← softmax(r_a W_L)
f ← f + i
Return: real-valued vector f
Molecules are one of the most common applications for graph neural networks, especially message passing neural networks. Molecules are naturally graph objects and GNNs provide a compact way to learn representations on molecular graphs. This line of work has been opened up by a seminal work NEF [16] where they built a neat connection between the process of constructing the most commonly used structure representation (molecular fingerprints) and graph convolutions. As shown in Algorithm [2]. It is worth noting that the commonly used string encoding for molecules (SMILES — Simplified Molecular Input Line Entry System) could be considered as a parsing tree (implicit graph representation) defined by the grammar.
There are mainly two branches of problems that have been discovered extensively with graph representation and graph neural networks: (1) predictive task, and (2) generative task. Predictive task refers to answering a specific question about certain molecules, such as the toxicity, energy, etc. of any given molecules. This is particularly beneficial for tasks like virtual screening which otherwise requires experiments to obtain the property of molecules. On the other hand, generative task aims to design and discover new molecules with certain interesting properties which is also called molecular inverse design. For predictive task, graph representation provides an efficient and effective way to encode the graph structure of molecules and lead to better performance in any downstream tasks of interest. For generative task, graph representation enables us to design the generative process in a more flexible way as the graph representation can be mapped to molecules deterministically.
Another research hot spot in modeling molecules with graphs is molecular pre-training which arises from the real-world applications. As the chemical space is gigantic (estimated to be \(10^{23}\) to \(10^{60}\) for small drug-like molecules, our explored areas are very limited. However, we have much more access to molecular structures without property annotations. This motivates the research into leveraging unlabeled molecular structures to learn general and transfferable representations which could be fine-tuned in any task even with a small amount of available labeled data.
Last but not least, the work we briefly talked about above is mostly about small drug-like molecules. However, graph representation is much more widely applied in a variety of molecules, such as proteins, RNAs (large bio-molecules), crystal structures or materials (with periodicity), etc. Also, we mainly focus on 2D graph representation in this blog, we will defer discussions about 3D graph representation to a later blog.
Graphs are naturally well-suited as a mathematical formalism for describing and understanding social systems, which usually involve a number of people and their interpersonal relationships or interactions. The most well-known practice in this regard is the concept of social networks, where each person is represented by a vertex (node), and the interaction or relationship between two persons, if any, is represented by an edge (link).
The practice of using graphs to study social systems dates back to the 1930s when Jacob Moreno, a pioneering psychiatrist and educator, proposed the graphic representation of a person’s social link, known as the sociogram [20]. The approach was mathematically formulated in the 1950s and became common in social science later in the 1980s.
Zachary’s karate club. To motivate the study of social networks, it is worth introducing Zachary’s karate club [21] as an example to start with. Zachary’s karate club refers to a university karate club studied by Wayne Zachary in the 1970s. The club had 34 members. If two members interacted outside the club, Zachary created an edge between their corresponding nodes in the social network representation. Figure 10 shows the resulted social network. What makes this social network interesting is that during Zachary’s study, a conflict arose between two senior members (node 1 and node 34 in the figure) of the club. All other members had to chosen sides among the two senior members, essentially leading to a split of the club into two subgroups (i.e. “communities”). As the figure shows, there are two communities of nodes centered at node 1 and 34 respectively.
Zachary further analyzed this network, and found that the exact split of club members can be identified purely based on the structure of the social network. Briefly speaking, Zachary runs a min-cut algorithm on the collected social network. The min-cut algorithm essentially serves to return a group of edges as the “bottleneck” spot of the whole social network. The nodes on different sides of the “bottleneck” are determined to belong to different splits. It turned out that Zachary was able to precisely identify the community belongings for all nodes except node 9 (which indeed lies right on the boundary as the figure shows). This example has often been used as a great example to suggest the fact that social networks (graphs) are a powerful formalism for revealing the underlying organizational truths of social systems.
Important domains of study. The research of social networks grew rapidly in the past few decades, and has spawned many branches. Exhausting all those branches will certainly go beyond the scope and capacity of this blog. Hereby we briefly survey a few of the most influential ones as the following.
Static structure. The first step towards understanding social networks is to analyze their static structural properties. The effort involves the development of scientific measures to quantify those properties, and the empirical measurement of them on real-world social networks. Generally speaking, a social network can be analyzed at local and global levels.
At local level, node centrality measures the “importance” of a person with respect to the whole network. Popular examples include degree centrality, betweenness centrality [22], closeness centrality [23], eigenvector centrality [24], PageRank centrality [25], etc. These measures differ by the different aspects of social importance they emphasize on. For example, the eigenvector centrality \(x_i\) of a person (node) \(i\) is defined in a recursive manner as: $$\begin{aligned} x_i = \frac{1}{\lambda} \sum_{j\in\mathcal{N}(i)} x_j
\end{aligned}\(where\)\lambda$$ is the largest eigenvalue of the adjacency matrix of the social network (and is guaranteed to be a real, positive number). This centrality measure is underpinned by the principle that a person’s role is considered more important if that person have connections with more important people. We refer interested readers to [24] for more details.
Besides node centrality, another example of local measurement is clustering coefficients [26], which measures the tendency of “triadic closure” around a center node: $$\begin{aligned} c_i = \frac{|{e_{jk} \in E: j,k\in \mathcal{N}(i)}|}{k_i(k_1-1)/2}
\end{aligned}$$
At global level, network distances and modularity are two measures for characterizing the macro structure of a social network. Popular network distance measures include shortest-path distances, random-walk-based distances, and (physics-inspired) resistance distance. Conceptually, they may be viewed as quantifiers of “difficulty” to travel along the edges of the social network from one node to another. Modularity often accompanies the important task of community detection for social networks. It measures the strength of division of a social network into groups or clusters of well-connected people.
Dynamic structure. Real-world social interactions often involve time-evolving processes. Therefore, many studies on social networks explicitly incorporate temporal information into the modeling. The task of link prediction, for example, has often been introduced in attempts to model the evolution of a social network. The task predicts whether a link will appear between two people at some (given) future time, and thereby predicting the evolution of the social network. Another area where dynamic structures of social networks are often discussed is when they are used to model face-to-face social interactions. Some of the most recent works on this regard abstract people’s interaction traits such as eye movement , eye gazing, “speaking to” or “listening to” relationships into attribute-rich dynamic links. It is believed that these dynamic interactions carry crucial information about the social event and people’s personalities. Therefore, using a temporal graph that explicitly models these interactions would greatly help the analysis of social interactions of such kind. For example, in [@wang2021tedic], researchers found that using a temporal graph to build prediction models helps machines to achieve state-of-the-art accuracy in identifying lying, dominance, as well as nervousness of people when they interact with each other in a role-playing game.
Information flow. Sometimes the structure of social networks is not the ultimate target of interest to researcher. Instead, people care about the fact that their opinions and decision making process are often affected by their social interactions with friends and acquaintances. Therefore, social networks are often regarded as the infrastructure on which information flows and opinion propagates. It is thus crucial to know how social networks of different structures can affect the spreading of information. A long line of works, for example, has been focusing on modeling the so-called opinion dynamics on social networks. Research in this area has seen such successful applications to viral marketing [28], international negotiations [29], as well as resource allocation [30].
There are many opinion dynamics models, and all of which are essentially mathematical models that describes how people’s opinion(s) on some matters, represented as numerical value(s), dynamically affect each other following some mathematical rules that rely on the network structure. Some of the most popular opinion dynamic models include voter’s model [31], Snajzd Model [32], Ising model [33], Hegselmann-Krause (HK) model [34], Friedkin-Johnsen (FJ) model [35] etc. Here we introduce Friedkin-Johnsen model as an example. The FJ model is not popular as a hot area to study by social scientists in recent years, but is also to date the only model on which a sustained line of human-subject experiments has confirmed the model’s predictions of opinion changes. The basic assumption of FJ model two opinions helf by each person \(i\) in the social network: an internal opinion \(s_i\) that is always fixed, and an external opinion \(z_i\) that evolves in adaption to \(i\)’s internal opinion and its neighbors’ external opinions. The evolution of external opinion \(s_i\) along time steps follows the rule: \(\begin{aligned} z^{0}_i &= s_i\\ z^{t+1}_i &= \frac{s^t_i+\sum_{j\in N_i}a_{ij}z^t_i}{1+\sum_{j\in N_i}a_{ij}} \end{aligned}\)
where \(N_i\) is the neighbors of node \(i\), \(a_{ij}\) is the interaction strength between persons \(i\) and \(j\).
One very elegant property of the FJ model is that the expressed opinions will reach a closed-form equilibrium eventually: \(\begin{aligned} z^{\infty} = (I+L)^{-1}s \end{aligned}\) where \(z^{\infty}, s\in \mathbb{R}^{|V|}\) are the opinion vectors. This closed-form equilibrium brings tremendous convenience for the many follow-up works [36,37,38,39] to further define indices of, for example, polarization, disagreement, and conflict on the equilibrium opinions.
Graph Neural Networks Foundations, Frontiers, and Applications
Graph Signal Processing: Overview, Challenges, and Applications
[1] Linton Freeman. The development of social network analysis. A Study in the Sociology of Science, 1(687):159–167, 2004.
[2] Michael GH Bell, Yasunori Iida, et al. Transportation network analysis. 1997.
[3] Jon Kleinberg and Steve Lawrence. The structure of the web. Science, 294(5548):1849–1850, 2001.
[4] Ed Bullmore and Olaf Sporns. The economy of brain network organization. Nature reviews neuroscience, 13(5):336–349, 2012.
[5] Kristel Van Steen. Travelling the world of gene–gene interactions. Briefings in bioinformatics, 13(1):1–19, 2012.
[6] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hirakawa. Kegg for representation and analysis of molecular networks involving diseases and drugs. Nucleic acids research, 38(suppl_1):D355–D360, 2010.
[7] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
[8] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[9] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[10] Yao Ma and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021.
[11] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations.
[12] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021.
[13] Francesco Di Giovanni, James Rowbottom, Benjamin P Chamberlain, Thomas Markovich, and Michael M Bronstein. Graph neural networks as gradient flows. arXiv preprint arXiv:2206.10991, 2022.
[14] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. What can neural networks reason about? In International Conference on Learning Representations.
[15] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1365–1374, 2015.
[16] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems, 28, 2015.
[17] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018.
[18] Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
[19] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022.
[20] Jacob Levy Moreno. Who shall survive?: A new approach to the problem of human interrelations. 1934.
[21] Wayne W Zachary. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33(4):452–473, 1977.
[22] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977.
[23] Alex Bavelas. Communication patterns in task-oriented groups. The journal of the acoustical society of America, 22(6):725–730, 1950.
[24] Mark EJ Newman. The mathematics of networks. The new palgrave encyclopedia of economics, 2(2008):1–12, 2008.
[25] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998.
[26] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
[27] Yanbang Wang, Pan Li, Chongyang Bai, and Jure Leskovec. Tedic: Neural modeling of behavioral patterns in dynamic social interaction networks. In Proceedings of the Web Conference 2021, pages 693–705, 2021.
[28] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208, 2009.
[29] Carmela Bernardo, Lingfei Wang, Francesco Vasca, Yiguang Hong, Guodong Shi, and Claudio Altafini. Achieving consensus in multilateral international negotiations: The case study of the 2015 paris agreement on climate change. Science Advances, 7(51):eabg8068, 2021.
[30] Noah E Friedkin, Anton V Proskurnikov, Wenjun Mei, and Francesco Bullo. Mathematical structures in group decision-making on resource allocation distributions. Scientific reports, 9(1):1377, 2019.
[31] Richard A Holley and Thomas M Liggett. Ergodic theorems for weakly interacting infinite systems and the voter model. The annals of probability, pages 643–663, 1975.
[32] Katarzyna Sznajd-Weron and Jozef Sznajd. Opinion evolution in closed community. International Journal of Modern Physics C, 11(06):1157–1165, 2000.
[33] Sergey N Dorogovtsev, Alexander V Goltsev, and José Fernando F Mendes. Ising model on networks with an arbitrary distribution of connections. Physical Review E, 66(1):016104, 2002.
[34] Hegselmann Rainer and Ulrich Krause. Opinion dynamics and bounded confidence: models, analysis and simulation. 2002.
[35] Noah E Friedkin and Eugene C Johnsen. Social influence and opinions. Journal of Mathematical Sociology, 15(3-4):193–206, 1990.
[36] Cameron Musco, Christopher Musco, and Charalampos E Tsourakakis. Minimizing polarization and disagreement in social networks. In Proceedings of the 2018 world wide web conference, pages 369–378, 2018.
[37] Christopher Musco, Indu Ramesh, Johan Ugander, and R Teal Witter. How to quantify polarization in models of opinion dynamics. arXiv preprint arXiv:2110.11981, 2021.
[38] Xi Chen, Jefrey Lijffijt, and Tijl De Bie. Quantifying and minimizing risk of conflict in social networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1197–1205, 2018.
[39] Shahrzad Haddadan, Cristina Menghini, Matteo Riondato, and Eli Upfal. Repbublik: Reducing polarized bubble radius with link insertions. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 139–147, 2021.
]]>This blog will change from time to time, since I am learning on those time.
The last update: 2023/05/25
Let me first narrow down the topic a little bit, or just take the graph research as an example.
A research paper can be categorized into different types, I will definitely prefer some while not focusing on others. Nonetheles, they are also good research.
An simple birdview can be found as follows:
To me, I will ask the following question:
Does the paper follows the common evaluation setting?
If I am familar with this research topic, the first thing I will do is to check whether the experimental setting has some tricks. This typically happens on some graph tasks. There could be some specific reasons, however, in most cases, the main reason is that their methods cannot work well. Moreover, it avoid that first believe in the overclaim in most paper, e.g., learn good graph representation for various downstream task, but only has the experiments in node classification.
Does the paper tries to define some new problems?
I think a paper can be a good paper once if it defines a novel problem, It could very hard to define a meaningful task from the real-world scenario. I just take some example.
If the paper is to propose a new solution to the existing problem, why it is neccerary to propose such method?
In most cases, I have the tolerence that the solution is not good if they define a new problem. However, if it only focuses on solving problem, there are many A+B type paper. Typically in Graph domain, once you see this year ICML, NeurIPS and ICLR, you will know what will next KDD and WWW. As far as I see, I still do not know the reason for utilizing diffusion model for some classification task in graph domain. I do think such paper is good for junior students to learn how to write the paper, make a method works, Nonetheless, there will always be some one who do the same thing with you. If you do not implement this idea, it is very likely that this idea will be done in less than one year. I would like to focus on more important topic that push the domain goes further.
If the paper is a theoritical inspired idea? What is the key intution underlying? Is there a toy example explaining the intuition?
]]>About leadership
About graph community
About my research:
About my life
The matrix factorization is one important techinique on learning representation. The main application of the matrix factorization are three-folds.
For the low rank approximation with missing data, the matrix factorization can be formed as: \(\min_{U,V} ||W \odot (M-UV)||\) where $W$ is the mask for the existing entrance. $M$ is the input matrix. However, this optimization will be ill-posed, which will definitely leads to overfitting. Addition assumption has to be made, the most importance is that low rank. It can be explain in an hard method as: $UV$. or it can be utilized with a soft constraint->nuclear norm. However, this constraint is somehow computational expensive.
Why matrix factorization? the matrix can be represented with a linear reductive structure, this structure will not change much on the different conditions, which is more robust. How to find this kind of subspace becomes a key problem.
**attention please what is the really $L$, normalized or not. **
Let the input data feature matrix $X=(x_1, \cdots , x_n)\in \mathbb{R}^{p\times n}$, PCA aims to find the optimal low-dimensional subspace with the principal direction $U=(u_1, \cdots , u_k)\in \mathbb{R}^{p\times k}$ , and the projected data point $V=(v_1, \cdots , v_n)\in \mathbb{R}^{n\times k}$. It aims to minimize the covariance with the following loss function:
\(\min _{U, V}\left\|X-U V^{T}\right\|_{F}^{2} \text { s.t. } V^{T} V=I\)
(projection direction和projected data point是我搞混了)
Notice that, the data is already centelized in the dataset.
In the above discussion, the input data $X$ is the only avaible vector data for learning a data representation. And for the manifold learning and graph embedding methods, only the graph adjacent matrix $W$ is taking into consideration. However, there lacks of methods which takes both graph structure $W$ and the node feature $X$ into consideration.
We first answer the question that how can those methods benefit from each other.
In most case, the feature-based matrix factorization on $X$ can only considers on the linear case, while the manifold learning can consider the local information (local linear -> global nonlinear relationship). They can find the data lies in a nonlinear data manifold. (In some cases, there is no $W$, graph is constructed based on the feature similarity.)
To take both the PCA and the Laplacian embedding into one framework, the target is as follow: \(\begin{array}{l} \min _{U, Q} J=\left\|X-U Q^{T}\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(Q^{T}(D-W) Q\right) \\ \text { s.t. } Q^{T} Q=I \end{array}\) The solution is that: \(\begin{array}{l} Q^{*}=\left(v_{1}, v_{2}, \cdots, v_{k}\right) \\ U^{*}=X Q^{*} \end{array}\) where $\left(v_{1}, v_{2}, \cdots, v_{k}\right)$ is the eigenvectors corresponding to the first $k$ smallest eigenvalues of $G = -X^TX + \alpha L$
这里不太明白的是矩阵的求导的过程。
We first need to find the solution by fix $Q$, then we find the result for $U$.
Finally, in order to balance two terms easily, or keep them in the same scale, the data is normalized by $\lambda_n$, the largest eigenvalue of $X^TX$, and the $\epsilon_n$, the largest eigenvalue of Laplacian matrix $L$.
**A question: how to connect is with the graph neural networ, I think dual may be a solution **
However, the above one is not as robust, a robust version with a weaken norm is given then \(\begin{array}{l} \min _{U, Q} J=\left\|X-U Q^{T}\right\|_{2,1}^{2}+\alpha \operatorname{Tr}\left(Q^{T}(D-W) Q\right) \\ \text { s.t. } Q^{T} Q=I \end{array}\) In order to give into a Augmented Lagrange Multiplier feasible form, it can be rewritten as: \(\begin{array}{l} \min _{U, Q, E}\|E\|_{2,1}+\alpha \operatorname{Tr} Q^{T}(D-W) Q \\ \text { s.t. } E=X-U Q^{T}, Q^{T} Q=I \end{array}\) I do not know much about the proximal optimizer and Augmented Lagrange Multipler, also need knowledge on the gradient on the nrom
Then, it can be written as the two argumented term of $E-X+U Q^{T}$ as: \(\begin{array}{l} \min _{U, Q, E}\|E\|_{2,1}+\operatorname{Tr} C^{T}\left(E-X+U Q^{T}\right) \\ \quad+\frac{\mu}{2}\left\|E-X+U Q^{T}\right\|_{F}^{2}+\alpha \operatorname{Tr} Q^{T} L Q \\ \text { s.t. } Q^{T} Q=I \end{array}\) Fix $E$, the problem is the same as the original one, while fixed $U$ and $Q$, the problem change as follows: \(\min _{E}\|E\|_{2,1}+\frac{\mu}{2}\|E-A\|_{F}^{2}\) where $A = X-UQ^T-C/\mu$, which is viewed as a group, which can then be viewed as $n$ independent matrix as: \(\min _{e_{i}}\left\|e_{i}\right\|+\frac{\mu}{2}\left\|e_{i}-a_{i}\right\|^{2}\) Then the constraint parameter can be update as \(\begin{array}{l} C=C+\mu\left(E-X+U Q^{T}\right) \\ \mu=\rho \mu \end{array}\) 这里有一个比较大的问题，是project的data尽量正交 和 选的坐标基底尽可能正交
L是否normalized 是否会有影响
Similarly, there are another work on the PCA based network, however, the representation constraint is not the same with the original one. The target is that: \(\begin{array}{l} \min _{U, Q} J=\left\|X-Z W^{T}\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(Z^{T}\tilde{L} Z\right) \\ \text { s.t. } W^{T} W=I \end{array}\) Notice that, the smooth term is different, it aims to smooth the representation, but not the coodinate.
The solution steps are similar with the above one, and the solution is that: \(\begin{aligned} W^{*} &=\left(\mathbf{w}_{1}, \mathbf{w}_{2}, \ldots, \mathbf{w}_{k}\right) \\ Z^{*} &=(I+\alpha \tilde{L})^{-1} X W^{*} \end{aligned}\) where $\mathbf{w}{1}, \mathbf{w}{2}, \ldots, \mathbf{w}_{k}$ are the eigenvectors corresponding to the largest $k$ eigenvalues of the matrix $X^T(I+\alpha \tilde{L})^{-1}X$
So the question is that:
貌似现在的分析只能对应的是中间矩阵是半正定的情况。
矩阵分解和直接优化对应的目标，进行分解有什么本质的区别？感觉似乎是没有的
The optimization forms takes that: \(\begin{array}{l} \min _{U, Y} J=\left\|A-UY\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(Y\tilde{L} Y^T\right) \\ \text { s.t. } U^{T} U=I \end{array}\) The solution is then find the optimized $(U, Y)$ pair, since results in the same result with $(UQ, Q^TY)$.
The formulation can be written without a factorization form. \(\min_{\text{rank(X)}\le r}\left\|A-X\right\|_{F}^{2}+\alpha \operatorname{Tr}\left(X\Phi X^T\right)\) where $X=UY$, does not result in the final result.
As $\Phi$ can be the Laplacian matrix, which is semi-positive in most cases, the square root result, can be written as $\Phi+I = B^TB$, the $I$ is from the first term.
Then \(f(X)=||A||_F^2-2tr(XBB^{-1}A^T)+||XB||_F^2)=||A||_F^2+||XB-AB^{-1}||_F^2-||AB^{-1}||_F^2)\)
The other part of the method with the latenate iteration algorithm is similar with the above one. Algorithm suppose the solution by the SVD method, which is not the key in our analysis
Different from the above methods, the robust based method is more robust to the outliner, while the original PCA use the L2 norm may effect by this form since it is based on the gaussian noise assumption. L1 norm seems to be more robust, but it is non-convecx and hard to be optimized. Below methods are trying to solve this kind of situation.
This paper majorly add a convex optimizer trace regularization to avoid oversmooth, and argumented Lagrange multiplier is utilized a new optimizer.
The problem still with the matrix factorization framework can be found as: \(\min_{U,V} ||W \odot (M-UV)||_1 s.t. U^U = I\) The constraint is to avoid too many pairs appear in the final result. To give a low rank smooth optimizer, a small nuclear norm is given as a regualrization term as \(\min_{U,V} ||W \odot (M-UV)||_1+\lambda ||V||_* s.t. U^U = I\) Then the problem is sent to a ALM problem solver for the final result. induce that $E=UV$. \(\begin{aligned} f(E, U, V, L, \mu)=&\|W \odot(M-E)\|_{1}+\lambda\|V\|_{*}+\\ &\langle L, E-U V\rangle+\frac{\mu}{2}\|E-U V\|_{F}^{2}, \end{aligned}\) Then $U$ is solved via Orthogonal Procrustes, $V$ is solved via Singular Value Shrinkage, and $E$ is solved via Absolute Value Shirnkage.
We need to notice that at first that the below method do not have additiona graph structure for learning, it is conducted based on the feature similarity.
Graph can improve the cluster property due to the graph smoothness assumption on the low-rank matrix.
Different from the above work, this paper abandon the explict matrix factorization but an add framework.
Thr proposed model is as follows: \(\begin{array}{l} \min _{L, S}\|L\|_{*}+\lambda\|S\|_{1}+\gamma \operatorname{tr}\left(L \Phi L^{T}\right) \text {, } \\ \text { s.t. } X=L+S, \end{array}\) where $S$ is the sparse error, while $L$ is the low-rank approximation of $X$. The final term are defined the smoothness on the graph structure.
Then the problem can be rewritten as the following function to let $L$ become a condition for $L$ in ALM solver. \(\begin{array}{l} \min _{L, S}\|L\|_{*}+\lambda\|S\|_{1}+\gamma \operatorname{tr}\left(W \Phi W^{T}\right) \text {, } \\ \text { s.t. } X=L+S, L=W \end{array}\) Then for each index, an lagrange multiplier is give as $Z_1\in \mathbb{R}^{p\times n}$ and $Z_2\in \mathbb{R}^{p\times n}$,
Then the problem can be transformed into: \(\begin{aligned} (L, S, W)^{k+1} &=\underset{L, S, W}{\operatorname{argmin}}\|L\|_{*}+\lambda\|S\|_{1}+\gamma \operatorname{tr}\left(W \Phi W^{T}\right) \\ &+\left\langle Z_{1}^{k}, X-L-S\right\rangle+\frac{r_{1}}{2}\|X-L-S\|_{F}^{2} \\ &+\left\langle Z_{2}^{k}, W-L\right\rangle+\frac{r_{2}}{2}\|W-L\|_{F}^{2}, \\ Z_{1}^{k+1} &=Z_{1}^{k}+r_{1}\left(X-L^{k+1}-S^{k+1}\right), \\ Z_{2}^{k+1} &=Z_{2}^{k}+r_{2}\left(W^{k+1}-L^{k+1}\right), \end{aligned}\)
Then the problem can be solved by: \(\begin{array}{l} L^{k+1}=\operatorname{prox}_{\frac{1}{\left(r_{1}+r_{2}\right)}}\|L\|_{*}\left(\frac{r_{1} H_{1}^{k}+r_{2} H_{2}^{k}}{r_{1}+r_{2}}\right), \\ S^{k+1}=\operatorname{prox}_{\frac{\lambda}{r_{1}}}\|S\|_{1}\left(X-L^{k+1}+\frac{Z_{1}^{k}}{r_{1}}\right) \\ W^{k+1}=r_{2}\left(\gamma \Phi+r_{2} I\right)^{-1}\left(L^{k+1}-\frac{Z_{2}^{k}}{r_{2}}\right) \end{array}\) Assuming that a p-nearest neighbors graph is available, there are several methods to construct neighborhoods are
Similar with the above paper, this paper give mode detailed on how the graph based on feature similarity can enhance the performance.
In this methods, it introduce the graph smoothness on both samples and features smoothness, also the method can show clear cluster under some theoritical condition.
The target is as follow \(\begin{array}{l} \min _{U, S}\|S\|_{1}+\gamma_{1} \operatorname{tr}\left(U \mathcal{L}_{1} U^{\top}\right)+\gamma_{2} \operatorname{tr}\left(U^{\top} \mathcal{L}_{2} U\right), \\ \text { s.t. } X=U+S, \end{array}\) where $U$ is not constaint as the low dimensional representation.
The optimzation procedure is used via two graph constraints with Fast Iterative Soft Thresholding Algorithm
The graph is constructed with: \(A_{i j}=\left\{\begin{array}{ll} \exp \left(-\frac{\left\|\left(x_{i}-x_{j}\right)\right\|_{2}^{2}}{\sigma^{2}}\right) & \text { if } x_{j} \text { is connected to } x_{i} \\ 0 & \text { otherwise. } \end{array}\right.\) Two graphs are based on the sample similarity and data similarity respectively, how can they give us more information.
The graph of feature can provide a basis for data, which is well aligned with the corvariance matrix $C$.
The graph of samples provdethe embedding which has the similar interpretation as PCA. In a word, the Laplacian matrix has some similarity with the PCA based method.
Therefore, the low rank matrix should be able to represent by a linear combination of the feature and samples vector. The result is bounded by the gap between eigenvalues \(\begin{array}{c} \phi\left(U^{*}-X\right)+\gamma_{1}\left\|U^{*} \bar{Q}_{k_{1}}\right\|_{F}^{2}+\gamma_{2}\left\|\bar{P}_{k_{2}}^{\top} U^{*}\right\|_{F}^{2} \\ \leq \phi(E)+\gamma\left\|X^{*}\right\|_{F}^{2}\left(\frac{\lambda_{k_{1}}}{\lambda_{k_{1}+1}}+\frac{\omega_{k_{2}}}{\omega_{k_{2}+1}}\right) \end{array}\)
The Deep Matrix is similar with the DNN which that as the original matrix facotirxzaiton always give a binary factors like $X=X_1X_2$, it gives another form as $X=\prod_{i=1}^NX_i$.
The product graph is give as the Cartesian product of $\mathcal{G}_1$ and $\mathcal{G}_2$, where the Laplacian matrix can be represetned as \(L_{\mathcal{G}_1 \Box \mathcal{G}_2 } = L_1 \otimes I + I \otimes L_2\) And the function is defined by the eigenvectors from both individual Laplacian matrix: $\Phi , \Psi$,
$C$, the function map is defined as $C=\Phi^TX\Psi$, which map between the functional map between the function space of $\mathcal{G}_1$ and $\mathcal{G}_2$. It can also be called the signal on the product graph. The following property can be found. \(\alpha = \Phi^Tx=C\Psi^Ty=C\beta\) for $x=\Phi^T\alpha$ and $y=\Psi^T\beta$.
The optimization object is as follows: \(\min_X E_{data}(X)+\mu E_{dir}(X) s.t. rank(X) \lt r\) The dirichlet energy is \(E_{dir}(X)=tr(X^TL_rX)+tr(X^TL_cX)\) then we decompose $X$ as $X=AZB^T$, $Z$ is the signal lies in the latent product graph
For those three factor can also be factorized as: \(\begin{array}{l} \boldsymbol{Z}=\boldsymbol{\Phi}^{\prime} \boldsymbol{C} \boldsymbol{\Psi}^{\prime \top} \\ \boldsymbol{A}=\Phi \boldsymbol{P} \Phi^{\prime \top} \\ B=\boldsymbol{\Psi} Q \boldsymbol{\Psi}^{\prime \top} \end{array}\) The objective can be transformed into \(\min_{P,C,Q} ||(\Phi PCQ^T\Psi^T-M)||_F^2+tr(QC^TP^T\Lambda_rPCQ^T)+tr(PCQ^T\Lambda_cQC^TP^T)\)
Once we have two graphs, it is natural to think about the correlation between those graph, which is the function on the product graph.
It tries to give a unify view on the geometric matrix completion and graph regularized dimension reduction.
We give the form $X=\Phi C\Psi^T$
The matrix factorization can establish for basis consistency as the low dimension representation of $X$ can be represented as the span of $\Psi$ and $\Phi$.
Then it requires the correspondance with each eigenvalue as: \(E_{reg}=||C\Lambda_r-\Lambda_cC||^2\) where $\Lambda$ is the eigenvalue of the graph.
The motivation of this paper is to connect the matrix factorization based graph embedding method with GNN. In this way, it does not need to load the whole graph at once, but can use the sample to get the embedding of each node.
However, this paper has a very important drawback which that, there is no discussion on the feature space. The only input is the graph structure.
This paqper aim to analysze the connection between GCN and MF, simply GCN with MF only, anduise unitization and cotrain to learn a node classification model.
Analysis is done in the last layer: As the original GCN can be written as \(\mathbf{H}^{(-1)}=\sigma\left(\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}} \mathbf{H}^{(-1)} \mathbf{W}^{(-1)}\right)\) where $\mathbf{H}^{(-1)}$ is the hidden representation on the last layer.
To write it in a node-wise method. \(h_{i}^{(-1)}=\sum_{j \in I} \frac{1}{\sqrt{\left(d_{i}+1\right)\left(d_{j}+1\right)}} \mathbf{A}_{i, j} h_{j}^{(-1)}+\frac{1}{d_{i}+1} h_{i}^{(-1)}\) Notice that, here something trival happens that for the last layer represnetation $h_{i}^{(-1)}$ should not be the same from the left hand and the right hand.
Then it can be rewritten as: \(h_{i}^{(-1)}=\sum_{j \in I} \frac{1}{d_i}\sqrt{\frac{\left(d_{j}+1\right)}{\left(d_{i}+1\right)}}A_{i,j}h_j^{(-1)}\) Then the distance function that GCN tries to optimize becomes: \(l_{s}=\sum_{i \in I} \text { Cosine }\left(h_{i}^{(-1)}, \sum_{j \in I} \frac{1}{d_{i}} \sqrt{\frac{d_{i}+1}{d_{j}+1}} \mathbf{A}_{i, j} h_{j}^{(-1)}\right)\) Then if we choose the negative sample randomly, the optimal representation canb written as \(VV^T = log(|G|D^{(-1)}AD^{(-1)})-log(k)\) I do not think this form is very meanful since the form is the same with the LINE, what is the difference then. Another question is that, for the node embedding method, does it matter to have only one embedding or two embedding? The answer is that it does not matter a lot.
Also, I think one thing important is that we need to clarify the difference and connection with the Graph Embedding method.
]]>在回到北京的这段时间里，和很多人有了新的探讨，记录一下自己最近的想法。
在这两年的时间里，AI技术得到了越来越多的关注，也取得了原来不可能想象的一些成就，同样在学术圈也发生了和以往所有学科热潮当中完全不一样的趋势，简单举个例子：AI的论文数目和投稿数都是爆炸性的增长。这后面的原因我猜测大概有二：1. AI已经在很多个方向上有了比较较为成功的应用，这引起了来自不同领域的研究人员参与进来 2. AI模型作为一种黑盒模型，入门门槛很低，在torch等自动反向传播求导框架提出后，新的算法的提出就是某种意义上的搭积木。举一个我身边比较夸张的例子，一个非计算机专业的同学可以在很短的时间里，从不会写代码开始，到投一篇顶会结束（当然这也很厉害）。
有一个观点我和沐神比较相似，每年新增长的超过99%的AI方向的文章都属于科研训练式的文章。在这样的热潮下又出现了哪些缺失的地方呢？这也是我在思考的问题。
在AI一波接着一波的热潮当中，应用已经渗透到各个领域中了。OpenAI每隔几个月的时间就会搞出来一个大新闻，如AI会写代码了，AI可以核反应堆结合了，alphafold，GPT3等模型都取得了空前的成功。在NLP，CV等传统的AI涉足的领域，现在其实已经走向生活的每个地方，这也为社会创造了巨大的财富。比如推荐系统，系统1%的准确率提升就能带来超过上千万的收入。
在这种思想的指导下，提高性能、刷榜成为了一个必然的趋势，从netflix的数据挖掘大赛，到一年一度的kdd cup，以及大大小小数不胜数的kaggle比赛。工业界和学术界的共同关注，让AI科技为我们带来更好的生活，这是所有人都希望看到的场景，我们也期待这个领域未来能取得更好的发展。
但是在AI的进展中，人们最愿意提及的是我的模型performance是不是work，理论上的贡献似乎被淡化了。这里我从传统的数学建模角度来重新审视一下这个问题。
一个新方法的应该有以下几个步骤：
让我们带入一般AI文章的求解思路，看看有哪些部分可以讨论：
深度学习经常谈论：学习某种要素，单单把神经网络的反向传播当作学习，让BP来解决追求泛化的优化问题，以期待能学到对应的要素，以此来达到泛化的终极目标。然后我们会把求解的模型在一些流行的数据集上进行验证。
但在这里我想提几个问题：
针对第一个问题，根据no free lunch理论，我的答案在绝大多数情况下是否定的。
第二个问题，我在实际生产中也经常遇到，也就是说我本身设计的是一个理解上很合理的方法，但是实际做实验的时候没有用处。为什么？我的理解是，BP和优化器在帮我们做模型的选择，甚至随机种子的选择，隐层的维度，隐层的深浅都可能决定这个问题。这种感觉就像三体中智子锁死了人类的科技水平一样，一些神经网络的基础设计锁死了深度学习的研究空间
第三个问题，我的理解是不知道，深度学习大多通过一种后事实的角度来验证模型的有效性，我们在研究一个问题的时候甚至经常跳过第一个assumption的步骤，或者说assumption大多是一种很模糊的想当然的东西，拍拍脑子。transformer效果比rnn要好，那肯定是因为transformer学到了全局信息，然而什么是全局信息？为什么需要全局信息？transformer能怎样反应全局信息这个assumption？在很多场景下都是没有理解的问题
从历史的眼光来看待学科发展，也有很多学科是以应用出发，学科得以发展。比如和AI最相关的凸优化，其实是发了二战的”战争财“。在求解了各种各样的实际问题，最后收敛发展成为一个真正的科学，有理论基础，有哲学思想。我个人也很殷切的期待AI在这么多人参与，如此多财力投入的情况下，也能走向这样的过程。
举一个例子来判断应用和理论之间的关系，应用可以理解为整个数据空间，而理论是在这个数据空间当中underlying的本质的流型。那么理论能怎样影响实际呢？我觉得有以下两个角度：
在理论和应用的共同作用下，AI会迈向哲学，从感知计算迈向更高阶的智能。从correlation到causal，从look到do到imagine（what if）。
最后一句话总结我这个博客：让ai回归科学，回归本质的，数学的，哲学的思考，延续ai领域的生命力，让AI走向意识深处。
(能跳脱技术细节，对技术进行抽象，从问题空间重新思考问题是一种很爽的感觉,2333333)
一个发散的没思考好的角度： 如何从经典物理学的角度来理解机器学习现在的一些发展趋势？没思考明白。
和软工+AI方向的恩升聊了之后，从另外一个角度定义了AI未来的可能，也就是说AI作为辅助系统，重要的事情不在是模型的准确率而是作为辅助开发、维护，加强开发者体验的工具。抽象的角度来讲这个事情，也就是重新换回人在解决问题当中的主导地位，更关注实际情景中的evaluation
]]>Node classification is the most well-known topic on graph domain which aims to distuiguish the type of each node on graph. In this field, people also study much fundmental limitation on GNN. The main challenge is that we believe that GNN will be more powerful with more layers and more parameters. For example, it is easy for build a CNN with more than 100 layers. However, GNN always can not. To build deeper GNN with more parameters, people try to understand ane explain this problem and give some solutions.
We aim to answer the folowing research questions in deeper GNN:
Word at the from, this part view is somehow more difficult than my early review in graph classification, heterphily graph and domain adaptation with many advanced topics on GNN. I will try my best to understand and write about it. This will not be the last version of this blog. I aim to go beyond above it.
TODO
In the problematic deeper GNN, various problem has been proposed, various paper has proposed different problems. We will first introduce them quickly.
Among them, oversmooth is the main focus recently. Based on this, the following contents will build on the following perspectives:
In this section, we mainly focus on two most widely used theory understanding on oversmooth problem with and without considering the non-linear activation function.
Suppose that a graph $\mathcal{G}$ has $k$ connected components $\{C_i \}_{i=1}^k$
and the indication vector for the $i$
-th components is denoted by $1^{(i)}\in \mathbb{R}^n$, This vector indicates whether a vertex is in the component $C_i$
\(\mathbf{1}_{j}^{(i)}=\left\{\begin{array}{l}
1, v_{j} \in C_{i} \\
0, v_{j} \notin C_{i}
\end{array}\right.\)
Theorem 1 If a graph has no bipartite components, then for any $w\in \mathbb{R}^n$ and $\alpha \in (0,1]$
\(\begin{array}{l}
\lim _{m \rightarrow+\infty}\left(I-\alpha L_{r w}\right)^{m} \mathbf{w}=\left[\mathbf{1}^{(1)}, \mathbf{1}^{(2)}, \ldots, \mathbf{1}^{(k)}\right] \theta_{1} \\
\lim _{m \rightarrow+\infty}\left(I-\alpha L_{s y m}\right)^{m} \mathbf{w}=D^{-\frac{1}{2}}\left[\mathbf{1}^{(1)}, \mathbf{1}^{(2)}, \ldots, \mathbf{1}^{(k)}\right] \theta_{2},
\end{array}\)
where $\theta_1 \in \mathbb{R}^k$, \(\theta_2 \in \mathbb{R}^k\), i.e. they converge to a linear combination of ${1^{(i)}}^k_{i=1}$ and ${D^{\frac{1}{2}}1^{(i)}}^k_{i=1}$ respectively, which corresponds to the eigenspace to eigenvalue 0.
The proof understanding are as follows:
We see that without consideration on the the transformation part, the information will lose until only degree and the components information.
Let $\lambda_2$ denote the second largest eigenvalue of transition matrix $\tilde{T} = D^{−1}A$ of a non-bipartite graph, $p(t)$ be the probability distribution vector and $\pi$ the stationary distribution. If walk starts from the vertex $i$, $p_i(0) = 1$, then after $t$ steps for every vertex, we have: \(\left|p_{j}(t)-\pi_{j}\right| \leq \sqrt{\frac{d_{j}}{d_{i}}} \lambda_{2}^{t}\) TODO: check this theory, read GDC (上面的证明是哪篇了，我咋给忘了)
This paper consider the expressivity of GNN, a fundamental topic on deep learning, as we all know that the two-layer MLP has the expressive ability of any non-linear functions.
With consideration on the non-linear transformation, this paper find that as the layer size goes infinite, the output exponentially falls into the set of signal carrying information of connected component and node degree (A subspace that is invariant under the dynamics) (same with the upper one).
The key assumption is that weights on the non-linear transformation satisfy the conditions determined by the spectra of the (augmented) normalized Laplacian The speed approximate the invariant spce is $O((s\lambda)^L)$ where $s$ is the largest singular value of the matrix $W$, $\lambda$ corresponds to the eigenvalue of the laplacian matrix.
For a linear operator $P:\mathbb{R}^N\to \mathbb{R}^M$
and a subset $V \subset \mathbb{R}^N$, we denote the restriction of $P$ to $V$ by $P|_V$
Let $P \in \mathbb{R}^{N \times N}$ be the symmetric adjacent matrix
For $M \le N$, let $U$ be a $M$-dimensional subspace of $\mathbb{R}^N$. If $U$ is the eigenvector subspace of GNN, it has the following assumption
Then the linear mapping with constraint can be written $P|_{U^\perp}: U^\perp \to U^\perp$. The operator norm $\lambda$ of $P|_{U^\perp}$ is equal to $\lambda = sup_\mu|g(\mu)|$ where $g$ is the polynomial.
The subspace $\mathcal{M}\in \mathbb{R}^{N\times C}$ by a basis and a vector as: \(\mathcal{M}:=U \otimes \mathbb{R}^{C}=\left\{\sum_{m=1}^{M} e_{m} \otimes w_{m} \mid w_{m} \in \mathbb{R}^{C}\right\}\) The distance between a vector representation and the subspace is: \(d_{\mathcal{M}}(X):=\inf \left\{\|X-Y\|_{\mathrm{F}} \mid Y \in \mathcal{M}\right\}\) which is the closest Frounbies norm to the subspace.
The maximum singular value of non-linear transform $W_{lh}$ is denoted by $s_l = \prod_{h=1}^Hs_{lh}$
一个问题：为什么要在一个subspace上看这个问题呢
for any $X\in \mathbb{R}^{N\times C}$ which non-linear operation $\sigma$ decreases the distance $d_{\mathcal{M}}$
This theorem can be proved from three basic lemma
Lemma1
\(d_{\mathcal{M}}\left(PX\right) \leq \lambda d_{\mathcal{M}}(X)\)
Give a subspace $\mathcal{M} \in (e_m)_{m\in[M]}$, and any vector $X \in \mathbb{R}^{N\times C}$ can be written as $X=\sum_{m=1}^{N} e_{m} \otimes w_{m}$
. Given the distance to the subspace $d^2_{\mathcal{M}}(X)=\sum^N_{m=M+1}||w_m||^2$
.
Then $PX$ can be written as \(\begin{aligned} P X &=\sum_{m=1}^{N} P e_{m} \otimes w_{m} \\ &=\sum_{m=1}^{M} P e_{m} \otimes w_{m}+\sum_{m=M+1}^{N} P e_{m} \otimes w_{m} \\ &=\sum_{m=1}^{M} P e_{m} \otimes w_{m}+\sum_{m=M+1}^{N} e_{m} \otimes\left(\lambda_{m} w_{m}\right) \end{aligned}\) (The first tem will becomes 0 after the minimal mapping)
The second term can be rewritten as a linear combination of the eigenvectors. then the distance can be writtern as \(\begin{aligned} d_{\mathcal{M}}^{2}(P X) &=\sum_{m=M+1}^{N}\left\|\lambda_{m} w_{m}\right\|^{2} \\ & \leq \lambda^{2} \sum_{m=M+1}^{N}\left\|w_{m}\right\|^{2} \\ & \leq \lambda^{2} \sum_{m=M+1}^{N}\left\|w_{m}\right\|^{2} \\ &=\lambda^{2} d_{\mathcal{M}}^{2}(X) \end{aligned}\) where $\lambda$ is the supermum of the $\lambda$.
Lemma 2 \(d_{\mathcal{M}}\left(XW_{lh}\right) \leq s_{lh} d_{\mathcal{M}}(X)\) The prove is the same with the first lemma, whether the matrix $W$ or $P$ comes from right or left side not matters a lot for $P$ is a symmetric matrix.
Lemma 3
\(d_{\mathcal{M}}\left(\sigma(X)\right) \leq d_{\mathcal{M}}(X)\)
The proof of lemma three is different than the first two, for the activation function is element-wise, not vector-wise. First, we need to change the expression of the basis from both the node number $n$ and the $d$ dimension size. Let $(e_c’)_{c\in [C]}$ be the standard basis of $\mathbb{R}^C$. The norm is more like a matrix form: $(e_n \otimes e_c')_{c\in [C], n\in [N]}$
. any $X$ can be decoupled into:
Then \(\begin{aligned} d_{\mathcal{M}}^{2}(X) &=\sum_{n=M+1}^{N}\left\|\sum_{c=1}^{C} a_{n c} e_{c}^{\prime}\right\|^2 \\ &=\sum_{n=M+1}^{N} \sum_{c=1}^{C} a_{n c}^{2} \\ &=\sum_{c=1}^{C}\left(\sum_{n=1}^{N} a_{n c}^{2}-\sum_{n=1}^{M} a_{n c}^{2}\right) \\ &=\sum_{c=1}^{C}\left(\left\|X_{\cdot}\right\|^{2}-\sum_{n=1}^{M}\left\langle X_{\cdot c}, e_{n}\right\rangle^{2}\right) \end{aligned}\) The distance can be written as $d_{\mathcal{M}}^{2}(\sigma(X) = \sum_{c=1}^{C}\left(\left|X_{\cdot c}^+\right|^{2}-\sum_{n=1}^{M}\left\langle X_{\cdot c}^+, e_{n}\right\rangle^{2}\right)$
TODO: add the proof this part.
Then for GNN with $\mathcal{M}$ as the eigenvector to the largest eigenvalues, if will falls exponentially into the eigenspace when $s\lambda < 1$\
Actually, the former theorem also can somehow be viewed as the extension of the pagerank (markov) problem.
The fundmental theory is: any Markov process on finite states converges to a unique distribution (equilibrium) (stationary distribution) if it is irreducible and aperiodic.
A markov chain can be describe with the initial distribution $\pi_0$ corresponds to the state space $S$. Each step will transition according to the current step with the Probability transition matrix $P\in \mathbb{R}^{n\times n}$.
Stationary distribution means reach a unchanged state \(\tilde{\pi} = \tilde{\pi} P\)
A markov chain can have: 0, 1, $\infty$ stationary distribution, to keep a unique one, the following properties should be satisified
PageRank with random walk is an algorithm with markov proptery on graph, we will detail it on different versions in other blog.
The range of “neighboring” nodes that a node’s representation draws from strongly depends on the graph structure, analogous to the spread of a random walk.
The basic analysis tool is the sensitivity analysis (influence distribution) inspired by page rank.
The motivation is that the influence from different nodes will heavily affected by the graph structure. For example,
With the same step but node with different space position, the reachable node neighbor has significant differently.
Differences makes us to think that whether large or small neighborhood is good. The answer is neither.
What we need is the changable locality to different nodes.
To quantify how the neighbor influence the other nodes, influence distribution is proposed, which gives insight into how large a neighborhood a node is drawing information from.
The influence distribution is defined as:
For a simple graph $G = (V, E)$, let $h^{(0)}_x$
be the input feature and $h^{(k)}_x$
be the learned hidden feature of node $x \in V $at the k-th (last) layer of the model. The influence score $I(x, y)$ of node $x$ by any node $y \in V$ is the sum of the absolute values of the entries of the Jacobian matrix $\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}$ . We define the influence distribution $I_x$ of $x \in V$ by normalizing the influence scores: $I_x(y)=I(x,y)/ \sum_z I(x, z)$, or
\(I_{x}(y)=e^{T}\left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right] e /\left(\sum_{z \in V} e^{T}\left[\frac{\partial h_{x}^{(k)}}{\partial h_{z}^{(0)}}\right] e\right)\)
where $e$ is the all-ones vector.
**The finding is **
Influence distributions of common aggregation schemes are closely connected to random walk distribution, which has a limitation(stationary) distribution (graph is non-bipartite).
**Theorem **
Given a $k$-layer GCN with averaging aggregation, assume that all paths in the computation graph of the model are activated with the same probability of success $\rho$. Then the influence distribution $I_x$ for any node $x \in V$ is equivalent, in expectation, to the $k$-step random walk distribution on $\tilde{G}$ starting at node $x$.
It is proved by
The one-step differentiate step can be described with non-linear activation mark, degree, weight. Then \(\begin{aligned} \frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}} &=\sum_{p=1}^{\Psi}\left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right]_{p} \\ &=\sum_{p=1}^{\Psi} \prod_{l=k}^{1} \frac{1}{\widetilde{\operatorname{deg}}\left(v_{p}^{l}\right)} \cdot \operatorname{diag}\left(1_{f_{v_{p}^{l}}^{(l)}>0}\right) \cdot W_{l} \end{aligned}\) where $\Psi$ is the total number of paths, and the $\prod$ computed on each node on the path.
Then for a single node, it can be rewritten as: \(\left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right]_{p}^{(i, j)}=\prod_{l=k}^{1} \frac{1}{\tilde{\operatorname{deg}}\left(v_{p}^{l}\right)} \sum_{q=1}^{\Phi} Z_{q} \prod_{l=k}^{1} w_{q}^{(l)}\) $Z_q$ is the probablity of activation or not. The simplification is made over here. The assumption is that The activation is a probability with no relation to the weight and input, just a prob. Then the non-linear can be easily through aways as \(\mathbb{E}\left [ \left[\frac{\partial h_{x}^{(k)}}{\partial h_{y}^{(0)}}\right]_{p} \right ] = \rho \cdot \prod^1_{l=k}W_l \cdot \left(\sum_{p=1}^{\Psi} \prod_{l=k}^{1} \frac{1}{\widetilde{\operatorname{deg}}\left(v_{p}^{l}\right)}\right)\) The random probablity is just the last term. Actually, aggregation is just a random walk form.
The distribution will be a little change with GCN symmetric form, beening normalized by $(\widetilde{\operatorname{deg}}(x)\widetilde{\operatorname{deg}}(y))^{-\frac{1}{2}}$
W在这里会起到什么作用
Then we can unify GCN with the random walk, both of them will share a same stationary distribution.
Adapt to different local neighborhood range to enable to better adapt structure-aware representations.
The model is very simple:
It is to determine the importance of different ranges after looking on all of them. Maxpooling will find the suitable layer with the maximum influences
Inspired by the JKNet, which gives an unifying view of random walk (pagerank). The random walk will result in a stationary distribution (oversmooth) regardless of which node starts with. (need to check what the stationary distribution is.)
Thus, to remain the connection to the original node, it is natural to use the personal pagerank which gives a chance to return to the root node. To preserve locality and avoid oversmooth. This allows network with more large range of neighborhoods.
The personal pagerank takes the form:
$\boldsymbol{\pi}_{\mathrm{ppr}}\left(\boldsymbol{i}_{x}\right)=(1-\alpha) \tilde{\boldsymbol{A}} \boldsymbol{\pi}_{\mathrm{ppr}}\left(\boldsymbol{i}_{x}\right)+\alpha \boldsymbol{i}_{x}$
.
The solution will be:
\(\pi_{ppr}(i_x) = \alpha(I_n - (1-\alpha)\hat{\tilde{A}})^{-1}i_x\)
With the stationary distribution, the stationary hidden representation can be written as:
\(Z_{APPNP} = \text{softmax}\left(\alpha(I_n - (1-\alpha)\hat{\tilde{A}})^{-1}H\right)\)
where $H = f_\theta(X)$. Naturally, the transformation is seperate from the aggregation. this allows us to achieve a much higher range without changing the neural network(possible benefit will be discussed in later paper)
However, $\pi_{ppr}$ is a quite dense matrix which is computational expensive. An approximate version: topic-sensitive PageRank via power iteration. \(\begin{aligned} \boldsymbol{Z}^{(0)} &=\boldsymbol{H}=f_{\theta}(\boldsymbol{X}) \\ \boldsymbol{Z}^{(k+1)} &=(1-\alpha) \hat{\tilde{A}} \boldsymbol{Z}^{(k)}+\alpha \boldsymbol{H} \\ \boldsymbol{Z}^{(K)} &=\operatorname{softmax}\left((1-\alpha) \hat{\tilde{A}} \boldsymbol{Z}^{(K-1)}+\alpha \boldsymbol{H}\right) \end{aligned}\)
As JKNet propose the maxp pooling function to select on different layers, GPRGNN uses a learnable parameters on generalized pagerank for the layer selection. 原来的GPRGNN是咋优化的，这个优化不是很好理解。
It is first proposed for graph clustering. The GPR takes the form as:
$\sum_{k=0}^{\infty} \gamma_{k} \tilde{\mathbf{A}}_{\mathrm{sym}}^{k} \mathbf{H}^{(0)}=\sum_{k=0}^{\infty} \gamma_{k} \mathbf{H}^{(k)}$
.
Clustering of the graph is performed locally by thresholding the GPR score.
Other pagerank can be viewed as a specific choice of GPR. APPNP can be viewed as fixed $\gamma_k = \alpha(1-\alpha)^k$
The learnable $\gamma_{k} $ gives model ability to learn long or short range information adaptively.
The final form is similar with APPNP:
\[\begin{aligned} \boldsymbol{H}^{(0)} &=\boldsymbol{H}=f_{\theta}(\boldsymbol{X}) \\ \boldsymbol{H}^{k} &=\hat{\tilde{A}} \boldsymbol{H}^{k-1} \\ \boldsymbol{H} &= \sum_{k=0}^K\gamma_k\boldsymbol{H}^k\\ \boldsymbol{\hat{P}} &=\operatorname{softmax}\left(\boldsymbol{Z}\right) \end{aligned}\]We can see that different graphs appears differently, while the heterophily graph requires more information from the further neighborhoods.
The filter of GPRGNN is: $g_{\gamma, K}(\lambda)=\sum_{k=0}^{K} \gamma_{k} \lambda^{k}$
Assume $\sum \gamma = 1$ and $\gamma$ can be a minus number. if $\gamma > 0$, low-frequency filter, if $\gamma < 0$, high-frequency filter.
lemma 1看了
TODO: lemma 2
Assume the graph $G$ is connected and the training set contains nodes from each of the classes. Also assume that $k’$ is large enough so that the over-smoothing effect occurs for $H(k) , \forall k \ge k’$ which dominate the contribution to the final output Z. Then, the gradients of $\gamma$ and $\gamma$ are identical in sign for all $k \ge k’$ .
It means that when oversmooth happens, $\gamma_k$ will be 0
A vector $x\in \mathbb{R}^n$ defined on the vertices of the graph is the graphs signal. The basic operations are:
The general form of graph signal as: \(\min{\Delta(u)} \text{ subject to } (u, u)_{\tilde{D}}=1, (u, u_j)_{\tilde{D}}=1, j \in \{1, \cdots, n\}\) The solution will be: \(Lu = \lambda \tilde{D}u\) The generalized eigenvalue corresponds to the graph signal.
The fourier base is defined as:
With these fundamental information, revisiting GNN from graphs signal process. The answer is with informative feature, GNN only perform low-pass filter for denoising without any non-linear propoerty.
The motivation or this paper is： Why and when do graph neural networks work well for vertex classification?
The experiment verify with different noise level with different frequency on MLP.
reconstruct feature $\hat{X}_k=\tilde{D}^{-\frac{1}{2}}U[: k]^T\hat{X}_k$
TODO: add more experimental explaination
GNN can provide low frequency smooth data.
TODO: some review on the complexity and generalization
With the assumption that faeture is composed of the true feature $\bar{x}$ and noise $z(i)$, we have
Lemma 5 Suppose Assumption 4. For any $0 < \delta < 1/2$, with probability at least $1 − \delta$, we have \(\left\|\bar{X}-\tilde{A}_{r w}^{k} X\right\|_{D} \leq \sqrt{k \epsilon}\|\bar{X}\|_{D}+O(\sqrt{\log (1 / \delta) R(2 k)}) \mathbb{E}\left[\|Z\|_{D}\right]\) where $R(2k)$ is a probability that a random walk with a random initial vertex returns to the initial vertex after $2k$ steps.
The first and second terms are bias induced by filter and variance from the original noise. Bias will increase with more hops adjacent matrix with a speed of $O(\sqrt{\epsilon})$. where the variance will decrease like $O(1/deg^{k/2})$
Then the optimial $k$ is that:
Suppose that $\mathbb{E}[||Z||_D] \le \rho ||\bar{X}||_D$
for some $\rho = O(1)$. Let $k^*$
be defined by $k^*$
= $O(log(log(1/δ)\rho/\epsilon))$, and suppose that there exist constants $C_d$ and $\bar{d} > 1$ such that $R(2k) ≤ C_d/ \bar{d}^k$ for $k \le k^*$
. Then, by choosing $k = k^*$ , the right-hand side of (6) is $\tilde{O}( \sqrt{\epsilon})$
.
TODO, find the prove
understanding on GNN:
Similar with the heterphoily jobs, it uses the band-pass filtering of graph signal for the low-pass signal only consider the local activation patterns. The neural pathway encode higher-order forms of regularity in graphs, with higher signal.
is defined by the lazy random walk matrix: \(P=\frac{1}{2}(I_n+WD^{-1})\) where $x_t = P^Tx$ is the low frequency in the Geometric GNN.
The wavelet is then defined as: \(\left\{\begin{array}{l} \boldsymbol{\Psi}_{0}:=\boldsymbol{I}_{n}-\boldsymbol{P} \\ \boldsymbol{\Psi}_{k}:=\boldsymbol{P}^{2^{k-1}}-\boldsymbol{P}^{2^{k}}=\boldsymbol{P}^{2^{k-1}}\left(\boldsymbol{I}_{n}-\boldsymbol{P}^{2^{k-1}}\right), \quad k \geq 1 \end{array}\right.\) The geometric is defined as: \(U_px=\boldsymbol{\Psi}_{k_m}|\boldsymbol{\Psi}_{k_{m-1}}|\boldsymbol{\Psi}_{k_1}x||\) which is stack of the element-wise absolute value non-linearity.
Then all features are combined together as:
residual connection with a cutoff frequency.
Theory part only use some specific graph which GCN can not find but with scattering channels for better expressivity. like cyclic or bipartite.
This paper tries to extract the higher frequency (self loop and seletive layer) with the modified Markov Diffusion kernel, which tries to enlarge the receptive of GNN. similar with APPNP, another explanation and solution
It is similar to the shortest path kernel we introduced before, which focuses on the co-occurrance on a markov chain. \(d_{i j}(K)=\left\|\mathbf{Z}(K)\left(\mathbf{x}_{i}(0)-\mathbf{x}_{j}(0)\right)\right\|_{2}^{2}\) where $Z(K)=\frac{1}{K}\sum_{k=1}^KT^k$, $T$ is the transition matrix (adjacent) $T=A’ = (D + I)^{-1/2} ( A + I ) (D + I)^{-1/2}$
It can be simply reduce to the form \(\hat{Y}=\text{softmax}(\frac{1}{K}\sum_{k=0}^K\tilde{T}^kXW)\) with the Laplacian regularization as: \(\min{ h^TLh +\frac{1}{2}||h_i - x_i||_2^2} = \min{\frac{1}{2}\left(\sum_{i, j=1}^{n} \widetilde{\mathbf{A}}_{i j}\left\|\frac{\mathbf{h}_{i}}{\sqrt{d_{i}}}-\frac{\mathbf{h}_{j}}{\sqrt{d_{j}}}\right\|_{2}^{2}\right)+\frac{1}{2}\left(\sum_{i=1}^{n}\left\|\mathbf{h}_{i}-\mathbf{x}_{i}\right\|_{2}^{2}\right)}\) Then add the self-loop as: \(\hat{Y}=\operatorname{softmax}\left(\frac{1}{K} \sum_{k=1}^{K}\left((1-\alpha) \widetilde{\mathbf{T}}^{k} \mathbf{X}+\alpha \mathbf{X}\right) \mathbf{W}\right)\)
Theorem 1 $N(\tilde{T}^0)\subseteq N(\tilde{T}^0)\subseteq N(\tilde{T}^1)\subseteq N \cdots(\tilde{T}^0)$ smaller neighbor belongs to the larger neighborhoods
Theorem 2 the energy of infinite-dimensional receptive field (largest k) will not dominate the sum energy of our filter. (different pespective from oversquash)
TODO reading
GNN stacks different orders of neighborhood sequentially which first aggregation the first order neighbor, then the second in an recursive way as following
Then the focus on GNN is the drawback on this procedure, and what is the best way to learn from the multi-hop neighborhood.
Oversquash is also a problem in RNN, which increasing growth information is referred into a fixed-size representation space. GNN will have exponential propogation messgae which may fails from distant nodes and build the long-range dependence.
experiments shows that GNN always underfitting on the condition that fitting the tree-structure graph. Empicially, GNN will overfiting short-range signal rather than the long-range information squashed in the bottleneck.
The solution is that: add a direction between two nodes. An easy solution is to build a fully-connected GNN layer. (This is also the reason why graphormer can help)
other ablation finds that: larger hidden dimension do not have significant improvement.
Even half fully-connnected can help a lot.
All directed interaction is not neccerary needed without graph structure.
On this sequencial behavior, Adaboost is a good solution for the sequential relationship between different orders. To use a RNN-like GCN with iterative updating of the node weights.
our AdaGCN also follows this direction by choosing an appropriate f in each layer rather than directly deepen GCN layers
The base classifier is designed as: \(Z^l = f_\theta(\hat{A}^lX)\) with only a linear transformation.
adaboost is defined as: \(\begin{aligned} e r r^{(l)} &=\sum_{i=1}^{n} w_{i} \mathbb{I}\left(c_{i} \neq f_{\theta}^{(l)}\left(x_{i}\right)\right) / \sum_{i=1}^{n} w_{i} \\ \alpha^{(l)} &=\log \frac{1-e r r^{(l)}}{\operatorname{err}(l)}+\log (K-1) \end{aligned}\) What does $K$ means \(w_{i} \leftarrow w_{i} \cdot \exp \left(\alpha^{(l)} \cdot \mathbb{I}\left(c_{i} \neq f_{\theta}^{(l)}\left(x_{i}\right)\right)\right), i=1, \ldots, n\) The different is that $f_\theta$ is shared by different layer but with different parameter.
Inspired by the residual connection in CV, GNN also design specificed residual connection. JKNet and S2GC can be viewed as the Dense connection in GNN. residual connection in GNN, however, can only prevent fast performance degrade but not enhance performance. New perspective should be proposed.
It propose two simple yet effective techniques: Initial residual and Identity mapping.
The final form: \(\mathbf{H}^{(\ell+1)}=\sigma\left(\left(\left(1-\alpha_{\ell}\right) \tilde{\mathbf{P}} \mathbf{H}^{(\ell)}+\alpha_{\ell} \mathbf{H}^{(0)}\right)\left(\left(1-\beta_{\ell}\right) \mathbf{I}_{n}+\beta_{\ell} \mathbf{W}^{(\ell)}\right)\right)\) Initial residual is somehow similar with the feature similarity preserve. A question: if large embedding size, use a MLP layer.
This is somehow different with the APPNP with linear combination, it indeed makes deep with non-linear transformation.
Identity mapping is similar with res connection, but with also the influence on the non-linearity and the initial residual. Hard to find so much difference
Theory part
Theorem 1 Assume the self-looped graph $\tilde{G}$ is connected. Let $h^{(K)} = ( \frac{I_n+\tilde{D} {−1/2}\tilde{A}\tilde{ D}^ {−1/2}}{2})^K ·x$ denote the representation by applying a K-layer renormalized graph convolution with residual connection to a graph signal $x$. Let $\lambda \tilde{G}$ denote the spectral gap of the self-looped graph $\tilde{G}$, that is, the least nonzero eigenvalue of the normalized Laplacian $\tilde{L} = I_n − \tilde{D} ^{−1/2}\tilde{A} \tilde{D}^{ −1/2}$ . We have
1) As K goes to infinity, $h (K)$ converges to $\pi = <\frac{\tilde{D}^{1/2}1,x>} {2m+n}\cdot \tilde{D}^{1/2}1$, where 1 denotes an all-one vector.
2) The convergence rate is determined by \(\mathbf{h}^{(K)}=\pi \pm\left(\sum_{i=1}^{n} x_{i}\right) \cdot\left(1-\frac{\lambda_{\tilde{G}}^{2}}{2}\right)^{K} \cdot \mathbf{1}\) The prove keys are:
split the origin $h^K$ with linear combination of basis (which can represents the random walk) \(\tilde{\mathbf{D}}^{1 / 2} \mathbf{x}=\left(\mathbf{D}+\mathbf{I}_{n}\right)^{1 / 2} \mathbf{x}=\sum_{i=1}^{n}\left(\mathbf{x}(i) \sqrt{d_{i}+1}\right) \cdot \mathbf{e}_{\mathbf{i}}\)
use the lemma Let $p^(K)_i = (\frac{I_n+\tilde{A} \tilde{D}^{−1}}{2})^Ke_i$ is the K-th transition probability vector from node i on connected self-looped graph $\tilde{G}$. Let $\lambda\tilde{G}$ denote the spectral gap of $\tilde{G}$. The j-th entry of $p^{(K)}_i$ can be bounded by \(\left|\mathbf{p}_{i}^{(K)}(j)-\frac{d_{j}+1}{2 m+n}\right| \leq \sqrt{\frac{d_{j}+1}{d_{i}+1}}\left(1-\frac{\lambda_{\tilde{G}}^{2}}{2}\right)^{K} .\)
Theorem 2
Consider the self-looped graph $\tilde{G}$ and a graph signal $x$. A $K$-layer GCNII can express a $K$ order polynomial filter $\sum_{l=0}^K\theta_l\tilde{L}^l)x$ with arbitrary coefficients $\theta$
Too much assumption, not so good.
The most two popular regularization methods are batch norm and dropout. However, dropout can not work well on the graph architecture. Batchnorm can not capture the relation between nodes.
Various methods has been proposed for both enhance generalization ability, reduce overfitting, reduce oversmooth.
Dropedge is a natural extension of dropedge. Two way to view it:
Dropedge is two-step as:
**proof 1 ** 证的不咋地
\[\hat{l}(\mathcal{M}, \epsilon) \le \hat{l}(\mathcal{M}', \epsilon)\]$\epsilon$ smoothness is designed as the layer with $l^{*}(\mathcal{M}, \epsilon):=\min _{l}\left\{d_{\mathcal{M}}\left(\boldsymbol{H}^{(l)}\right)<\epsilon\right\}$
The relaxed $\epsilon$-smoothing is the upper bound of $\epsilon$-smooth as \(\hat{l}(\mathcal{M}, \epsilon):= \left\lceil\frac{\log \left(\epsilon / d_{\mathcal{M}}(\boldsymbol{X})\right)}{\log s \lambda}\right\rceil\) where $s$ is the largest eigenvalue, $\lambda$ is the second largest eigenvalue of $\hat{A}$
We also need to adopt some concepts from Lovász et al. (1993) in proving Theorem 1. Consider the graph $G$ as an electrical network, where each edge represents an unit resistance. Then the effective resistance, $R_{st}$ from node $s$ to node $t$ is defined as the total resistance between node $s$ and $t$. According to Corollary 3.3 and Theorem 4.1 (i) in Lovász et al. (1993), we can build the connection between $\lambda$ and $R_{st}$ for each connected component via commute time as the following inequality.
Remove any edge will cause $R_{st}$ increase, if remove into two bipatite, more dimension of information will be remained.
Take the idea of Dropedge, more tricks based on dropout also with contrastive loss is proposed. Grand has the similar advantage, a new advantage is enhance robustness.
Grand is two-fold:
Each node features can be randomly drop either partially (dropout) or entirely (dropnode)
decouple propogation and transformation $\bar{A}=\sum^K_{k=0}\frac{1}{K+1}\hat{A}^k$, then MLP
An additional consistency regularization for different views. \(\mathcal{L}_{\text {con }}=\frac{1}{S} \sum_{s=1}^{S} \sum_{i=0}^{n-1}\left\|\overline{\mathbf{Z}}_{i}^{\prime}-\widetilde{\mathbf{Z}}_{i}^{(s)}\right\|_{2}^{2}\) where $\bar{Z}_i’$ is the mean of different views.
The theorem is somehow little trivial with understanding on how these loss as the regularization term.
From the perspective of Batchnorm, the regularization normalization on GNN is proposed, which provents all node embedding becoming too similar.
The idea is somehow similar with feature similarity preservation.
Analysis
understanding that most GNNs perform a special form of Laplacian smoothing, which makes node features more similar to one another. The key idea is to ensure that the total pairwise feature distances remains a constant across layers, which in turn leads to distant pairs having less similar features, preventing feature mixing across clusters.
Two measurements are proposed: \(\begin{array}{l} \text { row-diff }\left(\mathbf{H}^{(k)}\right)=\frac{1}{n^{2}} \sum_{i, j \in[n]}\left\|\mathbf{h}_{i}^{(k)}-\mathbf{h}_{j}^{(k)}\right\|_{2} \\ \text { col-diff }\left(\mathbf{H}^{(k)}\right)=\frac{1}{d^{2}} \sum_{i, j \in[d]}\left\|\mathbf{h}_{\cdot i}^{(k)} /\right\| \mathbf{h}_{\cdot i}^{(k)}\left\|_{1}-\mathbf{h}_{\cdot j}^{(k)} /\right\| \mathbf{h}_{\cdot j}^{(k)}\left\|_{1}\right\|_{2} \end{array}\) where row-diff measure is the average of all pairwise distances between node features, quantifies node-wise oversmoothing. col-diff quantifies feature-wise smoothness.
其实感觉col diff会减小其实挺奇怪的。 Think of the reason on measurement here.
The reason why row-difference changes so sharply is still under discussion.
A new interpertation: graph-regularized least squares \(\min _{\overline{\mathbf{X}}} \sum_{i \in \mathcal{V}}\left\|\overline{\mathbf{x}}_{i}-\mathbf{x}_{i}\right\|_{\tilde{\mathbf{D}}}^{2}+\sum_{(i, j) \in \mathcal{E}}\left\|\overline{\mathbf{x}}_{i}-\overline{\mathbf{x}}_{j}\right\|_{2}^{2}\) where $\bar{X}\in \mathbb{R}^{N\times d}$, $\left|z_i\right|{\tilde{\mathbf{D}}}^{2} = z_i^T\tilde{\mathbf{D}}z_i$, with a closed form solution $\bar{X}=(2I - \tilde{A}{rw})^{-1}X$
其实重要的就是保护对应的表达空间。 We should not only consider smooth the same cluster, but also distant disconnected pairs \(\min _{\overline{\mathbf{X}}} \sum_{i \in \mathcal{V}}\left\|\overline{\mathbf{x}}_{i}-\mathbf{x}_{i}\right\|_{\tilde{\mathbf{D}}}^{2}+\sum_{(i, j) \in \mathcal{E}}\left\|\overline{\mathbf{x}}_{i}-\overline{\mathbf{x}}_{j}\right\|_{2}^{2} - \lambda \sum_{(i, j) \notin \mathcal{E}}\left\|\overline{\mathbf{x}}_{i}-\overline{\mathbf{x}}_{j}\right\|_{2}^{2}\) The distance should keep the same in total \(\sum_{(i, j) \in \mathcal{E}}\left\|\dot{\mathbf{x}}_{i}-\dot{\mathbf{x}}_{j}\right\|_{2}^{2}+\sum_{(i, j) \notin \mathcal{E}}\left\|\dot{\mathbf{x}}_{i}-\dot{\mathbf{x}}_{j}\right\|_{2}^{2}=\sum_{(i, j) \in \mathcal{E}}\left\|\mathbf{x}_{i}-\mathbf{x}_{j}\right\|_{2}^{2}+\sum_{(i, j) \notin \mathcal{E}}\left\|\mathbf{x}_{i}-\mathbf{x}_{j}\right\|_{2}^{2}\) To avoid high computional cost, the computational step is: \(\operatorname{TPSD}(\tilde{\mathbf{X}})=\sum_{i, j \in[n]}\left\|\tilde{\mathbf{x}}_{i}-\tilde{\mathbf{x}}_{j}\right\|_{2}^{2}=2 n^{2}\left(\frac{1}{n} \sum_{i=1}^{n}\left\|\tilde{\mathbf{x}}_{i}\right\|_{2}^{2}-\left\|\frac{1}{n} \sum_{i=1}^{n} \tilde{\mathbf{x}}_{i}\right\|_{2}^{2}\right)\) Further simplify will be $\operatorname{TPSD}(\tilde{\mathbf{X}}) = \operatorname{TPSD}(\tilde{\mathbf{X}}^c) = 2n||\tilde{\mathbf{X}}^c||^2_F$ which $X^c = X - \bar{X}$, the center representation
The final procedure are center and rescale: \(\begin{array}{l} \tilde{\mathbf{x}}_{i}^{c}=\tilde{\mathbf{x}}_{i}-\frac{1}{n} \sum_{i=1}^{n} \tilde{\mathbf{x}}_{i} \\ \dot{\mathbf{x}}_{i}=s \cdot \frac{\tilde{\mathbf{x}}_{i}^{c}}{\sqrt{\frac{1}{n} \sum_{i=1}^{n}\left\|\tilde{\mathbf{x}}_{i}^{c}\right\|_{2}^{2}}}=s \sqrt{n} \cdot \frac{\tilde{\mathbf{x}}_{i}^{c}}{\sqrt{\left\|\tilde{\mathbf{X}}^{c}\right\|_{F}^{2}}} \end{array}\)
GroupNorm gives us another interpretation on oversmooth, inter-class distance is smaller than intra-class distance. Moreover, it is somehow similar with the diffpooling.
The main ignore of the pairnorm ignore the same group nodes without connection. Nodes within the same community/class need be similar to facilitate the classification, while different classes are expected to be separated in embedding space from a global perspectives of view.
The challenges are:
Analysis measurement
group ratio is defined as: \(R_{\mathrm{Group}}=\frac{\frac{1}{(C-1)^{2}} \sum_{i \neq j}\left(\frac{1}{\left|\boldsymbol{L}_{i} \| \boldsymbol{L}_{j}\right|} \sum_{h_{i v} \in \boldsymbol{L}_{i}} \sum_{h_{j v^{\prime}} \in \boldsymbol{L}_{j}}\left\|h_{i v}-h_{j v^{\prime}}\right\|_{2}\right)}{\frac{1}{C} \sum_{i}\left(\frac{1}{\left|\boldsymbol{L}_{i}\right|^{2}} \sum_{h_{i v}, h_{i v^{\prime}} \in \boldsymbol{L}_{i}}\left\|h_{i v}-h_{i v^{\prime}}\right\|_{2}\right)}\) Isntance information Gain as the mutual information: \(G_{\text {Ins }}=I(\mathcal{X} ; \mathcal{H})=\sum_{x_{v} \in \mathcal{X}, h_{v} \in \mathcal{H}} P_{\mathcal{X H}}\left(x_{v}, h_{v}\right) \log \frac{P_{\mathcal{X} \mathcal{H}}\left(x_{v}, h_{v}\right)}{P_{\mathcal{X}}\left(x_{v}\right) P_{\mathcal{H}}\left(h_{v}\right)}\) The representation
这套文章提出来的指标都是自己解决好的指标，而不是为了寻找什么insight。
GroupNorm: normalize the node embeddings group by group. each group to be rescale to be more similar.
The two steps are:
The empirical study has the findings:
skip connection group:
Normalization
$\left(\mathbf{x}_{i} ; p\right)=\frac{\mathbf{x}_{i}}{\operatorname{std}\left(\mathbf{x}_{i}\right)^{\frac{1}{p}}}$
$\left(\mathbf{x}_{(k)}\right)=\mathbf{x}_{(k)}-\mathbb{E}\left[\mathbf{x}_{(k)}\right] $
$\left(\mathbf{x}_{(k)}\right)=\gamma \cdot \frac{\mathbf{x}_{(k)}-\mathbb{E}\left[\mathbf{x}_{(k)}\right]}{s t d\left(\mathbf{x}_{(k)}\right)}+\beta $
observation:
Drop observation
The difference between GNN and basic MLP is the aggregation (neighbor size). In the traditional GNN, more neighborhood means more parameter leads to overfiting. Many paper propose the decouple transformation and aggregation. So what is the key reason for oversmooth?
The key factor compromising the performance is entanglement of representation transformation and propogation.Decouple is the key component.
where $||\cdot||$
denotes the Euclidean norm
The smothness score will decade quickly on well-trained GCN, however, disentangle do not comes down quickly with only linear propogation. The distangement architecture is: \(\begin{aligned} Z &=\operatorname{MLP}(X) \\ X_{o u t} &=\operatorname{softmax}\left(\widehat{A}^{k} Z\right) \end{aligned}\)
Nothing new that $D^{-1}A$ and $D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ will converge into a vector.
Design on DAGNN, it utilizes an adaptive adjustment mechanism that can adaptively balance the information from local and global neighborhoods for each node \(\begin{array}{ll} Z=\operatorname{MLP}(X) & \in \mathbb{R}^{n \times c} \\ H_{\ell}=\widehat{A}^{\ell} Z, \ell=1,2, \cdots, k & \in \mathbb{R}^{n \times c} \\ H=\operatorname{stack}\left(Z, H_{1}, \cdots, H_{k}\right) & \in \mathbb{R}^{n \times(k+1) \times c} \\ S=\sigma(H s) & \in \mathbb{R}^{n \times(k+1) \times 1} \\ \widetilde{S}=\operatorname{reshape}(S) & \in \mathbb{R}^{n \times 1 \times(k+1)} \\ X_{\text {out }}=\operatorname{softmax}(\text { squeeze }(\widetilde{S} H)) & \in \mathbb{R}^{n \times c} \end{array}\) where $s\in \mathbb{R}^{n \times c}$ is a projection function, $c$ is the number of classes.
However, another point of view is proposed (the solution is related with spectral). which the transformation layer is learn to anti-oversmooth during training. The understanding is: untrained GCN indeed oversmooth, but the learning procedure will lean to distuiguish against it. However, the model is not well-trained against the oversmoothness.
This paper propose the understanding:
The forward procedure is optimized the smoothness Here has a learning rate analysis \(\begin{aligned} \nabla_{X} &=\frac{\partial R(X)}{\partial X}=\frac{1}{2} \frac{\partial \frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}}{\partial X}=\frac{\left(\Delta-I \frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}\right) X}{\operatorname{Tr}\left(X^{\top} X\right)} \\ X_{m i d} &=X-\eta \nabla_{X}=\frac{(2-\Delta) X}{2-\frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}}=\frac{\left(I+D^{-\frac{1}{2}} A D^{-\frac{1}{2}}\right)}{2-\frac{\operatorname{Tr}\left(X^{\top} \Delta X\right)}{\operatorname{Tr}\left(X^{\top} X\right)}} X \end{aligned}\)
The backward is for the classification loss which will reduce the oversmoothness
TODO read the proof
the connected nodes share similar representations (“similar” means the scale of each feature channel is approximately proportional to the square root of its degree) 为什么这个位置没有高阶的度数
Solution
meansubtract which will approach the second large eigenvalue.
The most interesting in this paper may reject the last paper perspective, which even single MLP will fall into the oversmooth problem. This paper study:
smoothness measurement The stationary state \(\hat{\mathrm{A}}_{i, j}^{\infty}=\frac{\left(d_{i}+1\right)^{r}\left(d_{j}+1\right)^{1-r}}{2 M+N}\) node smoothness: the similarity with the initialization state \(\begin{array}{c} \alpha=\operatorname{Sim}\left(\mathbf{x}_{v}^{k}, \mathbf{x}_{v}^{0}\right) \\ \beta=\operatorname{Sim}\left(\mathbf{x}_{v}^{k}, \mathbf{x}_{v}^{\infty}\right) \\ N S L_{v}(k)=\alpha *(1-\beta) \end{array}\)
The number of transformation is $D_t$, the number of propogation is $D_p$
The experiment is to have double propogation and transofrmation to test the performance.
When aggregation double, the performance does not matters too much in layer 8 with 16 propogations. Oversmooth is not the main concept. Also，with less smoothness, the performance does not change too much. But with more parameters, performance indeed drops.
So does more parameter cause the overfiting.
GCN on both train and test accuracy drop which is not overfiting, which reach the train acc: 100%. It is underfitting.
Entanglement with residual will have much slower performance drop while disentangle model performance drop more quickly.
MLP with no residual will drop on this state when stack more MLP.
The performance will decade without residual connection.
This paper provide two quantity measurement for analysis, MAD for smoothness (similarity in nodes), and MADGap for oversmoothness, which measure the informaion-noise ratio (inter-class and intra-class).
With these findings, the paper proposed MADgap regularization and adaedge to remove the intra-class edges.
The smoothness is measured by: \(D_{i j}=1-\frac{\boldsymbol{H}_{i,:} \cdot \boldsymbol{H}_{j,:}}{\left|\boldsymbol{H}_{i,:}\right| \cdot\left|\boldsymbol{H}_{j,:}\right|} \quad i, j \in[1,2, \cdots, n]\) The observation is that in the low layer, the information to noise ratio is larger with local neighborhoods.
MAD value of high-layer GNNs gets close to 0
The MADGAP is defined by \(\text{MADGap}=MAD^{rmt}-MAD^{neb}\) rmt is the MAD value remote nodes in graph topology
The regularization is defined as the MADGap
Evaluating Deep Graph Neural Networks
On Provable Benefits of Depth in Training Graph Convolutional Networks (Towrite)
ADAPTIVE UNIVERSAL GENERALIZED PAGERANK GRAPH NEURAL NETWORK
Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks (Towrite)
SIMPLE SPECTRAL GRAPH CONVOLUTION
ADAGCN: ADABOOSTING GRAPH CONVOLUTIONAL NETWORKS INTO DEEP MODELS
DIRECTIONAL GRAPH NETWORKS (graph classification) not so related
Graph Neural Networks Inspired by Classical Iterative Algorithms (Towrite)
Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks (Towrite)
Bag of Tricks for Training Deeper Graph Neural Networks: A Comprehensive Benchmark Study
Training Graph Neural Networks with 1000 Layers (Toread, not so related)
Optimization of Graph Neural Networks: Implicit Acceleration by Skip Connections and More Depth (toread)
GRAND: Graph Neural Diffusion(toread)
ON THE BOTTLENECK OF GRAPH NEURAL NETWORKS AND ITS PRACTICAL IMPLICATIONS
Revisiting “Over-smoothing” in Deep GCNs
Evaluating deep graph neural networks
Simple and Deep Graph Convolutional Networks
DROPEDGE: TOWARDS DEEP GRAPH CONVOLUTIONAL NETWORKS ON NODE CLASSIFICATION
PAIRNORM: TACKLING OVERSMOOTHING IN GNNS
Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View
Continuous Graph Neural Networks (toread)
Towards Deeper Graph Neural Networks
GRAPH NEURAL NETWORKS EXPONENTIALLY LOSE EXPRESSIVE POWER FOR NODE CLASSIFICATION
MEASURING AND IMPROVING THE USE OF GRAPH INFORMATION IN GRAPH NEURAL NETWORKS
Optimization and Generalization Analysis of Transduction through Gradient Boosting and Application to Multi-scale Graph
Neural Networks (toread)
Graph Random Neural Networks for Semi-Supervised Learning on Graphs
scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks
Towards Deeper Graph Neural Networks with Differentiable Group Normalization
Bayesian Graph Neural Networks with Adaptive Connection Sampling (toread)
Predict then Propagate: Graph Neural Networks meet Personalized PageRank
Representation Learning on Graphs with Jumping Knowledge Networks
DeepGCNs: Can GCNs Go as Deep as CNNs? (image, not so related)
Revisiting Graph Neural Networks: All We Have is Low-Pass Filters
Intuitively, the desirable representation of node features does not necessarily need too many nonlinear transformation f applied on them. This is simply due to the fact that the feature of each node is normally one-dimensional sparse vector rather than multi-dimensional data structures, e.g., images, that intuitively need deep convolution network to extract high-level representation for vision tasks. This insight has been empirically demonstrated in many recent works, showing that a two-layer fully-connected neural networks is a better choice in the implementation.
]]>