2023年悄悄过去了，2023年也算是人生第一次能感受到到那些顶级天才的世界，我也知道自己其实是一个天资比较一般的researcher，如果是这样的话，退出这样的竞争可能是最好的选择，矛盾的点在于，我又有强烈的想和这些人站在一起的冲动，有的时候也会很羡慕那些名校的phd，和我的世界不同，优秀的人聚集在一起，他们共同努力。2024年，希望可以继续沮丧着坚持，努力成为一个有insight的researcher，等待着机会和奇迹发生。正如人类群星闪耀时刻所写的，那位几乎从来都没有展现过音乐天赋的军人鲁日·德·李尔 ，在谱写马赛曲的那短短的一晚上撇见了那些属于伟大的不朽者的世界。我期待自己时时刻刻磨练着自己的六分技术，随时等待着那三分运气的到来，最后抓住仅有的一分灵感鲤鱼跃龙门，有这么一刻，可以一瞥那个不属于我的世界，一瞥那群人的姿态，即便放任我的余生在平庸中挣扎。
2024年 请沉默寡言 请忘记一切过去 请精益求精
]]>In the next coming year, what will be the future for the graph domain and why we need graph domain?
What is the advantage of a graph?
Graph can
For the graph domain, the essential thing is how to find the graph structure, for instance, Neural Network can be a graph. Nonetheless, how can we define the graph results in different properties. The Neural Network can be viewed as a graph, e.g., the DAG, computation graph can be one, nonetheless, it is less informative. How to define a better graph as the network structure can be a good direction (recent ICLR has two good submissions)
The graph can have different different usage, especially for system II intellegence, here are a few examples:
Why GNN can work?
Why we need graph?
How to build a meaningful graph from a non-graph data?
How to remove the structure from the graph data?
]]>山光物态弄春晖，莫为轻阴便拟归。 纵使晴明无雨色，入云深处亦沾衣。
今天是个挺有趣的日子，和致凯老师合作的paper今天终于过了100个star了，合作的paper rebuttal也小涨了一点分数，想想和致凯也已经合作很久了，晚上吃饭的时候也和致凯简单回味了一下这一年怎么一点点的到了今天。突然想聊聊这一年我和致凯老师一起在graph领域探索的一些故事，致凯老师也算第一个真正意义上和我合作的同学（特指一段完整的研究） 简单记录一下我生活当中的流水账，也很感谢致凯能出现在我的生活当中，博士这段旅程很枯燥，但能有几个朋友在身边感觉真的很好。
第一次和致凯接触是在他来美国之前两天加到群里的时候，之前一直听jt（我导师）说春季要入学两个同学，但大家一直不知道是谁。刚加进群里就加了下好友，问下他来之前有啥需求，当时记得帮他拉进msu的华人群让他看看有没有合适的转租，还给了他Michiganflyer的地址告诉他怎么从底特律来lansing。但第一次和致凯的相遇并不是特别的舒适，印象是周一的组会jt问新来的同学来了没，当天下午致凯好像来了（记得也没说几句话），jt的comment是so independent（当时好像只有我和haoyu主动加了致凯，大家都觉得新同学充满了神秘）。我们的lab就这么又多了一个朋友。
和大部分做graph的同学一样，致凯进组的第一件事情就是读spectral graph theory or graph signal process。”Go back to old school. When I was a PhD student….” 以及经典的juanhui和haoyu没有go deep into this direction. 这样的话似乎每个新生来了都会听一遍。带着jt的期许，致凯开始了他的学习和presentation。而这个时候我在干啥？这是我博士期间第二挣扎的时刻，第一是现在 (uptil not)。作为一个typical的30%（具有跳跃性思维的某种人格），我博士的第一个课题不能说大胆，只能说是异想天开，可当时的我刚读博士，年轻气盛，目中无人，并不这么觉得。GNN这么差，我一定能设计出一个比GNN更好的MLP！我尝试了各种优化目标以及各种各样的训练方法（从此以后再也不信bilevel optimization了），但似乎连gnn的影子都追不上，为了破局，我每两周课题方向都很可能做一个180°的大转弯，从synthetic设计到test aggregation，现在看那时做的slides都觉得羞耻。记得有一天晚上，因为ppt每次都在转向，把所有合作者都绕懵了，jt在zoom和我聊了一个多小时，回到家在slack打字又聊了一个多小时，还在lab和wei和haoyu纠缠了半个多小时。还记得那晚下了雪暴，树上晶莹剔透都是冰晶，或许大概我这课题可能真的做不出来了吧。（也很感谢那个时候jt一直consistent坚持这个课题啊）
山重水复疑无路，柳暗花明又一村。过了两天我脑子突然冒出一个idea，既然这么想让mlp超过gnn，为啥不直接拿mlp和gnn比一下？虽然略微偏离了原来研究的轨道，但似乎也能走到最后的通路？（过了几周又一次在组会前一天晚上灵光乍现，想出了论文里introduction里出现的例子，第一篇文章我的评价是真的是吃灵感的折阳寿之作）。虽然当时jt觉得新的路线也能做，但还是一直坚持原来的方向很有意义，就让致凯帮我做些实验，把我原来的方向做下去。（我印象当时还跟致凯想了一个很抽象的heterophily labal propogation的idea）致凯的实验结果不出所料，差的一批，最后这个方向也就不了了之了，而致凯和我开始合作，帮我跑一些baseline。这篇paper最后中了NeurIPS，但我不认为这是一段很好的合作，反思下来，第一次合作我并不是很懂合作和沟通的技巧，而且也确确实实让致凯干了很多很多体力活（调参）。
日子似乎就这样一天天过去，在实验快做完但是离ddl还比较远的一天（这篇paper整整脱产写了一个月时间，omg），我爸爸同学的孩子（在umich交换）要来msu找我，为了接他，我和zhiyu晚出门了一会，在家等着也有点无聊，就打开youtube正好刷到了yao fu的talk，我两就在客厅一起看了起来。当时对gpt的认知还不太完整，不过很庆幸有好多朋友为我引路，haonan的硬件发展史和jieyu在neurips就指出了prompt这条明路，这些高瞻远瞩的朋友也真是我的财富。在惶恐和不安之余，这是我第一次真正坐下来认真的正视LLM，新的技术新的方向，一切都是那么的新奇和激动，我感觉似乎一切都是崭新的开始。那天下午带着弟弟喂了松鼠回来，我累的瘫坐在lab的椅子上，决定要把gpt这么好的技术用在graph上，给致凯发了这个消息，给他发了个talk，问他愿不愿意做这个方向，他懵懵懂懂，这似乎就是一切的起点。
下周和jt one-one之前，我又问了致凯要不要做这个，第二天早上聊了这个方案的可行性后，记得当天就在白板上和jt聊了这个想法，他觉得不错，可以试一试。从后事实的角度来看，我们这个idea确实不太novel，但做的过程却比想象的艰难很多，好像没几个人想过graph这样不规则的数据结构怎么能和gpt这样处理seqeunce的模型相结合。市面上除了jianan和eli chien的两篇和bert结合的工作外，似乎没有什么其他可参考的工作，更凄惨的是，似乎没有太多数据集有文本特征来让我们进行处理，以及我们也不是做nlp的同学，对如何用gpt也不是十分理解，api的接口调用也不是很熟悉。包括JT也我们说, I think this idea is weired but you guys can have a try 以及 We do not change our direction, it is how high we climb on the mountain. 我印象当时我说了句，but this is new generation.
在彷徨中，我们逐步向前，没有数据集致凯就去找古早文献，自己手搓一个。转机发生在neurips结束后两三天，bid paper的时候看到了xiaoxin的graph LLM paper，发现他们也做LLM，更震惊的事情是他们处理好了好几个标准的数据集。因为看不到原文，我们当时还猜这文章估计是jianan之前在msra实习组的。当看到xiaoxin开源的数据集，我们的实验终于丰富了起来，意料之外的是，这篇文章最后成为了一篇technical report，发现了一些有的没的结论。完全出乎我们意料的是，这篇文章很快的引起了大家的关注（当然这篇文章也是我所有文章里宣传里最卖力的一个），github repo的star也一直在涨，论文发出去两个月，我们发现，可能我们赌对了。
在第一篇的实验观测上，我们顺理成章找到了LLM在graph领域的一种正确打开方式，但没想到要这么快投出去，我和致凯友谊升华的部分可能就在我们72个小时没离开实验室，日夜兼程一起赶文章的日子，累了就在桌子上趴一会，最后一晚甚至我两都快赶不行了，我在他的paper上改着改着，突然看到overleaf上他下线了，走到他工位看看咋回事，发现他头没低下来就在座位上昏昏沉沉的睡着了，过了半个小时，我坐在座位上一样，写着写着眼睛一闭，手还在键盘上就睡着了。论文按时交付的一刻，我两像战争结束从战壕当中爬出来的两兄弟一样，回家直接摊睡了一天。和以往的赶完ddl不同，这次我在想，就这样结束了吗，有人在和我一起干真好，但愿这一切辛苦都是值得的吧。
我一直觉得致凯是个慢热的人，随着接近一年的朝夕相处，我们也渐渐变成无话不说的朋友，从第一次来我家喝汤的拘谨，到来我家lansing厨房的快乐时光。或许没有去过很多地方，但在科研上我们确实见过了不少的风景。可能我们这种A+B的研究没什么值得骄傲的，他没有什么像初创公司成功一样波澜壮阔般精彩，也经历了不少的批评，但这种进一步的欣喜是真实的，我们切身体会到的。一个笃定而高效的执行者，给我人生带来的惊喜。
感谢致凯塑造了今天的我，一个更大胆, 思维更不局限的，更会与人交流的博士生。我期待博士后面的一切故事都能如此，平淡中有一些惊喜，和一些挚友，真实发生。也期待后面的日子我们每一篇工作都比上一篇更好。或许一起看起来自然而然，但或许我们背后的故事也并不简单， 纵使晴明无雨色，入云深处亦沾衣。
]]>在过于忙碌的生活里，突然有一天能抽出生活（今天橄榄球比赛umich打msu，msu被剃了个光头），没去学校干活，随便写写近日的一些东西
如果可以的话，也欢迎听听我弹的猎户星座（今晚猎户星座流星雨） https://www.bilibili.com/audio/au4129029?type=3&spm_id_from=333.999.0.0
《七绝·改西乡隆盛诗赠父亲》
孩儿立志出乡关，学不成名誓不还。
埋骨何须桑梓地，人生无处不青山。
最近很让人开心的事情是居家办公的感觉越来越好了，在家也能注重健康的同时，保证工作效率。和亮哥两个人一人一个办公室，一个卧室。最近家里添置了不少东西，曹博送了我一把吉他，我也每天开始用亮哥的划船机，现在人健壮了不少。 搬家时候买的发财树越长越好，叶压着叶，打起架来，细数了一下，又发出了四五个新叶，这样冬天就算外面枯黄了，家里也能有点绿意。
上午在工位上工作，看paper，排骨的喷喷香铺面而来，整个屋子有种厚重的美感,工作的时候还要时不时看一下是不是噗噗噗了
![]()
下午和亮哥去理发，本来以为是周六便宜，结果发现是周二便宜，虽然钱不多，但对经济拮据的phd来说，也不是很少。但也要改善生活啊。走去了路对面的中超，这学期开始非常克制自己的饮食，想买点东西放纵一下，但是我还是希望自己健康啊啊啊啊啊， 最后只买了两个黄金梨，一盒豆浆。惊喜的看到了自己喜欢喝的玉米糊糊，又买了一点小米，冬天的时候肯定暖暖乎乎的。 本来是想小米山药粥，但是山药和麻薯长得确实有点像，中超没有山药. 今年真的有注重饮食，戒掉了可乐和咖啡，吃饭也已酸奶和沙拉为主
今晚有猎户星座流星雨，晚饭后突然兴起，录了一首猎户星座， 很久没有弹琴录歌了，录了五十多遍之后终于断断续续的录成了一版. 录歌的时候想起了很多往事，经常导致忘了在弹什么。后面把客厅所有灯都关了，，只留了远远的一束台灯打向我，录制才好一点。
录了之后想分享给朋友，发现自己已经多少可以分享喜悦的朋友了。想起之前特别喜欢听猎户星座的时候，还是高三的时候，每天学到晚上一点多，爸妈在客厅陪我到十二点就睡了。当时爸妈为了陪我一般晚上七点先睡一觉，后面也养成了这样的习惯。一个人夜深人静的时候就很喜欢用小米二听这首歌。
来一起看看今夜的猎户星座流星雨吧（盗图曹博）
晚上十点又开了两个会，研讨会上大家都好敢想，感觉自己对前沿的认识还是太浅薄了。 调整下作息稍微早点休息，吃了一篇褪黑素沉沉睡去。
你还记得吗 那时的夜晚 是如何降临的 什么都不说 像来自天空 轻如指尖的触痛
你是否得到了 期待的人生 梦里的海潮声 他们又如何从 指缝中滑过 像吹在旷野里的风
情长 飘黄 静悄悄的时光 清晨 日暮 何处是我的归宿
世界在雾中 那些人说着 来吧 就不见了 从未看清过 这一座迷宫 所有走错的路口
那些死去的人 停留在夜空 为你点起了灯 有时你乘起风 有时你沉没
有时午夜有彩虹 有时你唱起歌 有时你沉默 有时你望着天空
情长 飘黄 静悄悄的时光 清晨 日暮 何处是我的归宿 情长 飘黄 静悄悄的时光 清晨 日暮 何处是我的归宿
随手记录一下听这首歌的时候都想到了啥：
那时的夜晚：高中那些疯狂的晚自习过后吧，呜嗷喊叫的回家，水仗，北方的白天太阳喧嚣，只有到了夜晚，凉风习习
你是否得到了，期待的人生？显而易见，我是个认知水平很低的人，但很庆幸知行合一，并没有那么痛苦，现在的生活虽然很辛苦，但远离尘世，有所盼望。
从未看清过 这一座迷宫 所有走错的路口 上周JT还和我说，不要为自己之前的愚蠢而悔恨，当你真正懂的时候，就证明你老了
那些死去的人 停留在夜空 为你点起了灯 有时你乘起风 有时你沉没 有时午夜有彩虹 想起我那位朋友离开也已经一年了吧，最后一次发消息，还是lansing出彩虹了，我发给他看
情长 飘黄 静悄悄的时光 清晨 日暮 何处是我的归宿
lansing到冬天了，新家再也不用走原来的尘嚣大路了，每天从树荫中路过，甚至觉得downtown是一个很繁华的地方了，与世隔绝的生活。 新家静的掉一颗针都能听到。
何处是我的归宿？
]]>Graph is a very basic data structure we learned in the Data Structures and Algorithms course. It naturally represents each instance as the node, and each edge denotes the pairwise relationship. It can be a natural representation of arbitrary data. For instance, in the computer vision domain, the image can be viewed as a grid graph. In the natural language processing domain, the sentence can be viewed as a path graph. In AI4Science, the graph can easily adapt to all scientific problems.
Graph Neural Networks (GNNs) is proposed which utilizes the strong capability of Neural Network on graph structural data. GNN architectures have found wide applicability across a myriad of contexts, with graph data drawn from diverse sources like social networks, citation networks, transportation networks, financial networks, to chemical molecules. Nonetheless, there is no consistent winning solution across all datasets, owing to the varied concepts that these graphs encode. For instance, GCN may work well on particular social networks while falling short in molecule graphs for it cannot capture particular key patterns on the graph.
Motivated by such a problem, in this blog, we focus on the question that when do GNN work and when not? In light of these questions, we
The insights gleaned from this understanding will serve as acatalyst for the advancement of model development for novel graph datasets, thereby fostering the wider adoption of GNNs in emerging applications.
Our typical findings are:
We will dive deep into both the underlying graph data mechanism and model mechanism in this blog and provide a full picture of Graph Neural Networks for node classification.
If you have any questions on the blog, feel free to send email to haitaoma@msu.edu
In this section, we will provide a brief introduction on task, model, and data properties we focus on and the main analysis tool we utilize.
The semi-Supervised Node Classification (SSNC) task is to predict the categories or labels of the unlabeled nodes based on the graph structure and the labels of the few labeled nodes. We normally use message propagation methods through the connections in the graph to make educated guesses about the labels of the unlabeled nodes. It has wide applications in inferring node attributes, social influence prediction, traffic prediction, air quality prediction, and so on.
Graph neural networks learn node representations by aggregating and transforming information over the graph structure. There are different designs and architectures for the aggregation and transformation, which leads to different graph neural network models.
We will mainly introduce GCN, a fundamental yet representative model. For one particular node, GCN aggregates the transformed features from its neighbors and does the averaging process.
$\textbf{Graph Convolutional Network (GCN).}$
From a local perspective of node $i$, GCN’s work can be written as a feature averaging process:
\[\mathbf{h}_i = \frac{1}{d_i}\sum_{j \in \mathcal{N}(i)}\mathbf{Wx}_j\]_where $\mathbf{h}_i$ denotes the aggregated feature. $d_i$ denotes the degree of node $i$, $\mathcal{N}(i)$ denotes the neighbors of node $i$, i.e., $d_i = \left| \mathcal{N}(i) \right|$. $\mathbf{W}^{(k)} \in \mathbb{R}^{l \times l}$ is a parameter matrix to transform the features, while $\mathbf{x}_j$ denotes the initial feature of node $j$. Notably, the weight transformation step will not be the focus of our paper since it is general in deep learning. We typically focus on the aggregation The key reason for the aggregation step is that people assume that the model can be neighborhood nodes are similar to the center node, which is called homophily assumption. Therefore, aggregation can benefit from such similarity and achieve a smooth and discriminative representation.
Recent works reveal different graph properties, e.g., degree, the length of the shortest path, could influence the effectiveness of GNN. Among them, people recognize that homophily and heterophily are the most important properties, which are the key focus of this paper. People generally believe that the neighborhood nodes are similar to the center node, which is called homophily. Therefore, aggregation can benefit from neighborhoods to achieve a more smooth and discriminative representation.
Homophily. If all edges only connect nodes with the same label, then this property is called Homophily, and the graph is call a Homophilous graph.
In Fig.1, the number denotes the label, and different colors denote distinct features. It is shown that all nodes with similar features have edges connected, and also share the same label, illustrating a perfect homophily.
Heterophily. If all edges only connect nodes with different labels, then this kind of attribute is called Heterophily and the graph is called a Heterophilous graph. Fig.2 below shows a heterophilous graph. In this toy example, each node with label 0(1) only connects nodes with label 1(0).
Graph Homophily Ratio.
Given a graph $\mathcal{G} = {\mathcal{V, E} }$ and node label vector $y$, the edge homophily ratio is defined as the fraction of edges that connect nodes with the same labels. Formally, we have:_
\[h(\mathcal{G}, \{y_i; i \in \mathcal{V}\}) = \frac{1}{\left| \mathcal{E} \right| } \sum_{(j, k) \in \mathcal{E}} \mathbb{I}(y_j = y_k)\]where $ | \mathcal{E} | $ is the number of edges in the graph and $\mathbb{I}(\cdot)$ denotes the indicator function. |
A graph is typically considered to be highly homophilous when $0.5 \le h(\cdot) \le 1$. On the other hand, a graph with a low edge homophily ratio ($0 \le h(\cdot) < 0.5$) is considered to be heterophilous.
Node Homophily Ratio.
Node homophily ratio is defined as the proportion of a node’s neighbors sharing the same label as the node. It is formally defined as:
\[h_i = \frac{1}{d_i} \sum_{j \in \mathcal{N}(i)} \mathbb{I}(y_j = y_i)\]where $\mathcal{N} (i)$ denotes the neighbor node set of $v_i$ and $d_i = | \mathcal{N}(i) |$ is the cardinality of this set.
Similarly, node $i$ is considered to be homophilic when $h_i \ge 0.5$, and is considered heterophilic otherwise. Moreover, this ratio can be easily extended to higher-order cases $h_{i}^{(k)}$ by considering $k$-order neighbors $\mathcal{N}_k(v_i)$.
To examine whether GNNs perform well or not, we focus on whether GNN can encode a discriminative representatation. For instance, the ideal discrminative representation can be described as: (1) Cohension: nodes with the same label are mapped into similar representation (2) Seperation: nodes with different labels are mapped into dis-similar representations. The Fig.3 below illustrates an example of high cohension and good seperation, where each color indicates one class. We can observe that each cluster is in the same class while different clusters are distant from each other. We can then expect to use a simple linear classifier to achieve high performance, which shows an ideal representation.
In this section, we illustrate that GNNs can actually do better: Homophily, nodes connect with similar ones, which is not a necessity for the success of GNNs. GNNs can still work on various heterophily datasets (nodes connect with dissimilar ones). To achieve this goal, we focus on whether GNN can achieve discriminative representation in different settings.
In this subsection, we examine when GCN can map nodes with the same label to similar embeddings. We first play with toy graph examples, homophily and heterophily graphs, which are shown in Section 2.3. In particular, we examine node representations from different classes after the GNN.
GCN under homophily: The aggregation process for the homophily graph is shown in Fig. 4, where node color and number represent node features and labels, respectively. We can easily observe that, after mean aggregation, all the nodes with class 1 are in blue, and class 2 in red, indicating a good discriminative ability. x
GCN under heterophily The aggregation process for the homophily graph is shown in Fig. 5, where node color and number represent node features and labels, respectively. We can easily observe that there appears a color alternation. Before aggregation, all the nodes with class 1 are in blue, and class 2 is in red. In contrast, all the nodes with class 1 are in red, and class 2 is in blue after mean aggregation. Nonetheless, such alternation does not influence the discriminative ability. Notably, the nodes with the same class are still in the same color while nodes with different classes are in different colors, indicating a good discriminative ability.
More rigorously, we provide a theoretical understanding of what kinds of graphs could benefit from the GNNs and how. GNN can perform well on the graphs satisfying:
The rigorous theoretical analysis are shown as follows: (You can skip the following part for heavy math!)
Consider a graph $\mathcal{G} = \mathcal{V}, \mathcal{E}, { \mathcal{F}_{c}, c \in \mathcal{C} }$, ${ \mathcal{D}_{c}, c \in \mathcal{C}} $.
For any node $i\in \mathcal{V}$, the expectation of the pre-activation output of a single GCN operation is given by \(\mathbb{E}[{\bf h}_i] = {\bf W}\left( \mathbb{E}_{c\sim \mathcal{D}_{y_i}, {\bf x}\sim \mathcal{F}_c } [{\bf x}]\right).\)
and for any $t>0$, the probability that the distance between the observation ${\bf h}_i$ and its expectation is larger than $t$ is bounded by
\[\mathbb{P}\left( \|{\bf h}_i - \mathbb{E}[{\bf h}_i]\|_2 \geq t \right) \leq 2 \cdot l\cdot \exp \left(-\frac{ deg(i) t^{2}}{ 2\rho^2({\bf W}) B^2 l}\right)\]where $l$ denotes the feature dimensionality and $\rho({\bf W})$ denotes the largest singular value of ${\bf W}$, $B\geq\max _{i, j}|\mathbf{X}[i, j]|$.
We can than have the rigorous conclusion that the inner-class distance (distance between $h_i$ the expectation in the same class $\mathbb{E}[h_i]$) on the GCN embedding is small with a high probability, which is due to the sampling from its neighborhood distribution $\mathcal{D}_{y_i}$. Notably, the key step in the proof is the Hoeffding inequality. Details can be found in the paper.
To further verify the validity of the theoretical results, we provide more empirical evidence as follows. In particular, we manually add synthetic edges to control the homophily ratio of a graph and examine how the performance varies.
When adding synthetic heterophily edges on a homophily graph, there are two typical things to control:
As we insert heterophilous edges, the graph homophily ratio will also continuously decrease. The results are plotted in Fig.6.
Each point on the plot in Fig.6 represents the performance of GCN model and the corresponding value in the $x$-axis denotes the homophily ratio. The point with homophily ratio $h=0.81$ denotes the original $Cora$ graph, i.e., $K=0$.
The observations are shown as follows:
The experiment verifies our findings. If the neighborhood follows a similar distribution, GCN is still able to perform well under extreme heterophily. However, if we introduce noise to the neighborhood distribution, the effectiveness of GCN will not be guaranteed.
In section 3, we discuss the scenario when GNN can do well, including both homophily and heterophily graphs. All the above analyses are from a graph (global) perspective, verifying that the GNN can achieve overall performance gain. However, when we look closer into a node (local) perspective, we find the overlooked vulnerability of GNNs.
Instead of understanding from a graph perspective, the following analyses focus on nodes in the same graph but with different properties. We first plot the distribution of node homophily ratio on different datasets, shown in Fig.7. We typically include two homophily graphs and two heterophily ones. Additional results on ten different datasets can be found in the original paper. $h$ in the brackets indicating the graph homophily ratio. The $h_{node}$ on the $x$-axis denotes the node homophily ratio. We can clearly observe that:
Equipped with the analysis of node-level data patterns, we then investigate how GNN performs on nodes with different patterns. In particular, we compare GCN with MLP-based models since they only take the node features as the input, ignoring the structural patterns. If GCN performs worse than MLP, it indicates the vulnerability of GNNs. Experimental results are illustrated in Fig. 8.
We can observe that:
Similar to Section 3.2, we first conduct an analysis on a similar toy example. This time, instead of considering GNN under homophily and heterophily separately, we take the homophily and heterophily patterns together into consideration. The illustration is shown in Figure 9. The aggregation process for the homophily graph is shown in Fig. 4, where node color and number represent node features and labels, respectively.
We can observe that when considering the homophily and heterophily together:
The above observations on the toy model show that GNN cannot work well on both homophily and heterophily ones. Then we further ask if GNN can learn homophily or heterophily ones well. The answer will be the majority ones in the training set.
Motivated by the toy example, we then provide theoretical understanding rigioursly from a node level. We find that two keys on test performance are:
The following theorem is based on the PAC-Bayes analysis, showing that both large aggregation distance and homophily ratio difference between train and test nodes lead to worse performance. (You can skip the following part for heavy math!)
The theory typically aims to bound the generalization gap between the expected margin loss $\mathcal{L}_{m}^{0}$ on test subgroup $V_m$ for a margin $0$ and the empirical margin loss $\hat{\mathcal{L}}_{\text{tr}}^{\gamma}$on train subgroup $V_{\text{tr}}$ for a margin $\gamma$. Those losses are generally utilized in PAC-Bayes analysis. The formulation is shown as follows:
Theorem (Subgroup Generalization Bound for GNNs):
Let $\tilde{h}$ be any classifier in the classifier family $\mathcal{H}$ with parameters ${ \tilde{W}_{l} } _{l=1}^{L}$ .
For any $0< m \le M$, $\gamma \ge 0$, and large enough number of the training nodes $N_{\text{tr}}=|V_{\text{tr}}|$, there exist $0<\alpha<\frac{1}{4}$ with probability at least $1-\delta$ over the sample of $y^{\text{tr}} = { y_i } $, $i \in V_{\text{tr}}$ we have:
\[\mathcal{L}_m^0(\tilde{h}) \le \mathcal{L}_\text{tr}^{\gamma}(\tilde{h}) + O\left( \underbrace{\frac{K\rho}{\sqrt{2\pi}\sigma} (\epsilon_m + |h_\text{tr} - h_m|\cdot \rho)}_{\textbf{(a)}} + \underbrace{\frac{b\sum_{l=1}^L\|\widetilde{W}_l\|_F^2}{(\gamma/8)^{2/L}N_\text{tr}^{\alpha}}(\epsilon_m)^{2/L}}_{\textbf{(b)}} + \mathbf{R} \right)\]The bound is related to three terms:
(a) describes both large homophily ratio difference $|h_{\text{tr}} - h_m|$ and large aggregated feature distance $\epsilon = \max_{j\in bV_m}\min_{i\in V_{\text{tr}}} |g_i(X, G)-g_j(X, G)|_2$ between test node subgroup $V_m$ and training nodes $V_{\text{tr}}$ lead to large generalization error. $\rho= |\mu_1 - \mu_2 |$denotes the original feature separability, independent of structure. $K$ is the number of classes.
(b) further strengthens the effect of nodes with the aggregated feature distance $\epsilon$, leading to a large generalization error.
(c) $R$ is a term independent with aggregated feature distance and homophily ratio difference, depicted as $\frac{1}{N_\text{tr}^{1-2\alpha}} + \frac{1}{N_\text{tr}^{2\alpha}} \ln\frac{LC(2B_m)^{1/L}}{\gamma^{1/L}\delta}$, where $B_m= \max_{i\in V_\text{tr}\cup V_m}|g_i(X,G)|_2$ is the maximum feature norm. $\mathbf{R}$ vanishes as training size $N_0$ grows.
Our theory suggests that both homophily ratio difference and aggregated feature distance to training nodes are key factors contributing to the performance disparity. Typically, nodes with large homophily ratio differences and aggregated feature distance to training nodes lead to performance degradation.
To further verify the validity of the theoretical results, we provide more empirical evidence showing the empirical performance disparity. In particular, we compare the performance of different node subgroups divided with both homophily ratio difference and aggregated feature distance to training nodes. For a test node $i$, we measure the node disparity by
We then sort test nodes in terms of $s_1$ and $s_2$ and divide them into 5 equal-binned subgroups accordingly. We include popular GNN models including GCN, SGC (Simplified Graph Convolution), GAT (Graph Attention Network), GCNII (Graph Convolutional Networks with Inverse Inverse Propagation), and GPRGNN (Generalized PageRank Graph Neural Network). The Performance of different node subgroups is presented in Fig.9. We note a clear test accuracy degradation with respect to the increasing differences in aggregated features and homophily ratios.
We then conduct an ablation study that only considers aggregated features distance and homophily ratios in Figures 10 and 11, respectively. We can observe that the decrease tendency disappears in many datasets. Only combining these factors together provides a more comprehensive and accurate understanding of the reason for GNN performance disparity.
Inspired by the findings, we investigate the effectiveness of deeper GNN models on SSNC tasks.
Deeper GNNs enable each node to capture a more complex higher-order graph structure than vanilla GCN, by reducing the over-smoothing problem. Deeper GNNs empirically exhibit overall performance improvement. Nonetheless, which structural patterns deeper GNNs can exceed and the reason for their effectiveness remains unclear.
To investigate this problem, we compare vanilla GCN with different deeper GNNs, including GPRGNN, APPNP, and GCNII, on node subgroups with varying homophily ratios. Experimental results are shown in Fig.11. We can observe that deeper GNNs primarily surpass GCN on minority node subgroups with slight performance trade-offs on the majority node subgroups. We conclude that the effectiveness of deeper GNNs majorly contributes to improved discriminative ability on minority nodes.
Having identified where deeper GNNs excel, reasons why effectiveness primarily appears in the minority node group remain elusive. Since the superiority of deeper GNNs stems from capturing higher-order information, we further investigate how higher-order homophily ratio differences vary on the minority nodes, denoted as, $|h_u^{(k)}-h_v^{(k)}|$, where node $u$ is the test node, node $v$ is the closest train node to test node $u$. We concentrate on analyzing these minority nodes $V_{\text{mi}}$ in terms of default one-hop homophily ratio $h_u$ and examine how $\sum_{u\in V_{\text{mi}}} |h_u^{(k)}-h_v^{(k)}|$ varies with different $k$ orders.
Experimental results are shown in Fig.12, where a decreasing trend of homophily ratio difference is observed along with more neighborhood hops. The smaller homophily ratio difference leads to smaller generalization errors with better performance.
In this blog, we investigate when GNN works and when not. We find that the effectiveness of vanilla GCN is not limited to the homophily graph. Nonetheless, the vulnerability is hidden under the success of GNN. We typically provide some suggestions before you build your own solution to the graph problem.
We remain some questions for future works:
[1]Ma, Yao and Jiliang Tang. “Deep learning on graphs.” Cambridge University Press, 2021.
[2]Ma, Yao, Xiaorui Liu, Neil Shah, and Jiliang Tang. “Is homophily a necessity for graph neural networks?.” arXiv preprint arXiv:2106.06134 (2021).
[3]Mao, Haitao, Zhikai Chen, Wei Jin, Haoyu Han, Yao Ma, Tong Zhao, Neil Shah, and Jiliang Tang. “Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?.” arXiv preprint arXiv:2306.01323 (2023).
[4]Kipf, Thomas N., and Max Welling. “Semi-supervised classification with graph convolutional networks.” arXiv preprint arXiv:1609.02907 (2016).
[5]Hamilton, Will, Zhitao Ying, and Jure Leskovec. “Inductive representation learning on large graphs.” Advances in neural information processing systems 30 (2017).
[6]Xu, Keyulu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. “How powerful are graph neural networks?.” arXiv preprint arXiv:1810.00826 (2018).
[7]Fan, Wenqi, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. “Graph neural networks for social recommendation.” In The world wide web conference, pp. 417-426. 2019.
[8]Zhu, Jiong, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. “Beyond homophily in graph neural networks: Current limitations and effective designs.” Advances in neural information processing systems 33 (2020): 7793-7804.
]]>论文网址：
https://arxiv.org/abs/2307.03393
代码地址：https://github.com/CurryTang/Graph-LLM
图是一种非常重要的结构化数据，具有广阔的应用场景。在现实世界中，图的节点往往与某些文本形式的属性相关联。以电商场景下的商品图(OGBN-Products数据集)为例，每个节点代表了电商网站上的商品，而商品的介绍可以作为节点的对应属性。在图学习领域，相关工作常把这一类以文本作为节点属性的图称为文本属性图(Text-Attributed Graph, 以下简称为TAG)。TAG在图机器学习的研究中是非常常见的, 比如图学习中最常用的几个论文引用相关的数据集都属于TAG。除了图本身的结构信息以外，节点对应的文本属性也提供了重要的文本信息，因此需要同时兼顾图的结构信息、文本信息以及两者之间的相互关系。然而，在以往的研究过程中，大家往往会忽视文本信息的重要性。举例来说，像PYG与DGL这类常用库中提供的常用数据集(比如最经典的Cora数据集)，都并不提供原始的文本属性，而只是提供了嵌入形式的词袋特征。在研究过程中，目前常用的 GNN 更多关注于对图的拓扑结构的建模，缺少了对节点属性的理解。
相比于之前的工作，本文主要研究如何更好地处理文本信息，以及不同的文本嵌入与GNN结合后如何影响下游任务的性能。要更好地处理文本信息，那目前最流行的工具便非大语言模型(LLM)莫属(本文考虑了BERT到GPT4这些在大规模语料上进行了预训练的语言模型，因此使用LLM来泛指这些模型)。相比于TF-IDF这类基于词袋模型的文本特征，LLM有以下这几点潜在的优势。
考虑到LLM的多种多样性，本文的目标是针对不同种类的LLM设计出合适的框架。鉴于LLM与GNN的融合问题，本文把LLM首先分类为了嵌入可见与嵌入不可见两类。像ChatGPT这类只能通过接口进行交互的LLM就属于后者。其次，针对嵌入可见类的LLM，本文考虑三种范式：
对于这些嵌入可见的大模型，可以首先用它们来生成文本嵌入，然后将文本嵌入作为GNN的初始特征从而将两类模型融合在一起。然而，对于嵌入不可见的ChatGPT等LLM，如何将它们强大的能力应用于图学习相关的任务便成为了一个挑战。
针对这些问题，本文提出了一种将LLM应用到图学习相关任务的框架，如下图1与图2所示。对于第一种模式LLMs-as-Enhancers,主要是利用大模型的能力对原有的节点属性进行增强，然后再输入到GNN模型之中来提升下游任务的性能。针对嵌入可见的LLM，采取特征级别的增强，然后采用层级或迭代式(GLEM, ICLR 2023)的优化方法将语言模型与GNN结合起来。对于嵌入不可见的LLM，采取文本级别的增强，通过LLM对原有的节点属性进行扩充。考虑到以ChatGPT为代表的LLM的零样本学习与推理能力，本文进一步探索了利用prompt的形式来表示图节点的属性与结构，然后利用大模型直接生成预测的模式，将这种范式称为LLMs-as-Predictors。在实验部分，本文主要采用了节点分类这一任务作为研究对象，我们会在最后讨论这一选择的局限性，以及拓展到别的任务上的可能。接下来，顺延着论文中的结构，在这里简要分享一下各种模式下有趣的发现。
首先，本文研究利用LLM生成文本嵌入，然后输入到GNN中的模式。在这一模式下，根据LLM是否嵌入可见，提出了特征级别的增强与文本级别的增强。针对特征级别的增强，进一步考虑了语言模型与GNN之间的优化过程，将其进一步细分为了级联式结构(cascading structure)与迭代式结构(iterative structure)。下面分别介绍两种增强方法。
对于特征级别的增强，本文考虑的主要是语言模型、GNN、以及优化方法三个因素。从语言模型上来说，本文考虑了以Deberta为代表的预训练语言模型、以Sentence-BERT为代表的开源句子嵌入模型、以text-ada-embedding-002为代表的商业嵌入模型，以及以LLaMA为代表的开源大模型。 对于这些语言模型，本文主要从模型的种类以及模型的参数规模来考量其对下游任务的影响。
从GNN的角度来说，本文主要考虑GNN设计中的消息传递机制对下游任务的影响。本文主要选取了GCN,SAGE与GAT这两个比较有代表性的模型，针对OGB上的数据集，本文选取了目前排行榜上名列前茅的模型RevGAT与SAGN。本文也纳入了MLP对应的性能来考察原始嵌入的下游任务性能。
从优化方法的角度，本文主要考察了级联式结构与迭代式结构。对于级联式结构，本文考虑直接通过语言模型输出文本嵌入。对于那些规模较小可以进行微调的模型，本文考虑了基于文本的微调与基于结构的自监督训练(ICLR 2022, GIANT)。 不管是哪种方式，最后会得到一个语言模型，然后利用它来生成文本嵌入。这一过程中，语言模型与GNN的训练是分开的。对于迭代式结构，本文主要考察GLEM方法(ICLR 2023)，它使用EM和变分推断来对GNN和语言模型进行迭代式的共同训练。
在实验部分，本文选取了几个有代表性的常用TAG数据集，具体的实验设定可以参考我们的论文。 接下来，首先展示这一部分的实验结果(鉴于空间有限，在这里展示了两个大图上的实验结果)，然后简要讨论一些有意思的实验结果。
从实验结果来看，有以下几个有意思的结论。
第一，GNN对不同的文本嵌入有截然不同的有效性。特别明显的一个例子发生在Products数据集上，以MLP作为分类器时，经过微调的预训练语言模型Deberta-base的嵌入要比TF-IDF的结果好很多。然而，当使用GNN模型后，两者的差异很小，特别是使用SAGN模型时TF-IDF的表现要更好。这一现象可能与GNN的过光滑、过相关性有关，但目前还没有比较完整的解释，因此也是一个有意思的研究课题。
第二，使用句子向量模型作为编码器，然后与GNN级联起来，可以获得很好的下游任务性能。特别是在Arxiv这个数据集上，简单将Sentence-BERT与RevGAT级联起来，就可以达到接近GLEM的性能，甚至超过了做了自监督训练的GIANT。注意，这并不是因为用了一个参数量更大的语言模型，这里使用的Sentence-BERT为MiniLM版本，甚至比GIANT使用的BERT参数量更小。这里可能的一个原因是基于Natural Language Inference(NLI)这个任务训练的Sentence-BERT提供了隐式的结构信息，从形式上来说NLI与link prediction的形式也有一些相似。当然，这还只是非常初步的猜想，具体的结论还需要进一步探究。另外，从这一结果也给了一些启发，比如考虑图上的预训练模型时，能不能直接预训练一个语言模型，通过语言模型预训练更加成熟的解决方案，是不是还可以获得比预训练GNN更好的效果。同时，OpenAI提供的收费嵌入模型在节点分类这个任务上相比开源模型的提升很小。
第三，相比于未经微调的Deberta，LLaMA能够取得更好的结果，但是与句子嵌入这一类的模型还是有不小的差距。这说明相比于模型的参数大小，可能模型的种类是更重要的考量。对于Deberta，本文采用的是[CLS]作为句子向量。对于LLaMA，本文使用了langchain中的llama-cpp-embedding，它的实现中采用了[EOS]作为句子向量。在之前的相关研究中，已经有一些工作说明了为什么[CLS]在未经微调时性能很差，主要是由于其本身的各项异性，导致很差的可分性。经过实验，在高样本率的情况下，LLaMA生成的文本嵌入可以取得不错的下游任务性能，从侧面说明了模型的参数量增大可能可以一定程度上缓解这一问题。
对于特征级别的增强，本文得到了一些有意思的结果。但是，特征级别的增强还是需要语言模型是嵌入可见的。对于ChatGPT这类嵌入不可见的模型，可以使用文本级别的增强。对于这一部分，本文首先研究了一篇最近挂在Arxiv上的文章Explanation as features(TAPE)，其思想是利用LLM生成的对于预测的解释作为增强的属性，并通过集成的方法在OGB Arxiv的榜单上排到了第一名的位置。另外，本文也提出了一种利用LLM进行知识增强的手段Knowledge-Enhanced Augmentation(KEA)，其核心思想是把LLM作为知识库，发掘出文本中与知识相关的关键信息，然后生成更为详尽的解释，主要是为了不足参数量较小的语言模型本身知识信息的不足。两种模型的示意图如下所示。
为了测试两种方法的有效性，本文沿用了第一部分的实验设定。同时，考虑到使用LLM的成本，本文在Cora与Pubmed两个小图上进行了实验。对于LLM，我们选用了gpt-3.5-turbo，也就是大家所熟知的ChatGPT。首先，为了更好地理解如何进行文本级别的增强以及TAPE的有效性，我们针对TAPE进行了详细的消融实验。
在消融实验中，我们主要考虑了以下几个问题
从实验结果可以看到，伪标签非常依赖于LLM本身的zero shot预测能力（会在下一章详细讨论），在低样本率时，可能反而会拖累集成后的性能。因此，在后续的实验中，本文只使用原始属性TA与解释E。其次，句子编码相比于微调预训练模型，可以在低标注率下取得更好的效果，因此本文采用句子编码模型e5。除此以外，一个有趣的现象是在Pubmed数据集上，当使用了增强后的特征，基于微调的方法可以取得非常好的性能。一种可能的解释是模型主要是学到了LLM预测结果的“捷径”(shortcut)，因此TAPE的性能会与LLM本身的预测准确率高度相关。接下来，我们比较TAPE与KEA之间的有效性。
实验结果中，KEA与TAPE相比原始特征都有一定的提升。其中，KEA在Cora上可以取得更好的效果，而TAPE在Pubmed上更为有效。经过下一章的讨论后，会发现这与LLM在Pubmed上本身就有良好的预测性能有关。相比于TAPE，由于KEA不依赖LLM的预测，所以在不同数据集上的表现会更稳定一些。超越这两个数据集之外，这种文本级别的增强还有更多的应用场景。像BERT或者T5这一类比较小的预训练语言模型，往往不具备ChatGPT级别的推理能力，同时也没有办法像ChatGPT那样对不同领域的诸如代码、格式化文本有良好的理解能力。因此，在涉及到这些场景的问题时，可以通过ChatGPT这类大模型对原有的内容进行转换。在转换过后的数据上训练一个较小的模型可以有更快的推理速度与更低的推理成本。同时，如果本身也有一定量的标注样本，通过微调的方式会比上下文学习更好地掌握数据集中的一些个性化信息。
在这一部分，本文进一步考虑能否抛弃GNN，通过设计prompt来让LLM生成有效的预测。由于本文主要考虑的是节点分类任务，因此一个简单的基线是把节点分类看作是文本分类任务来处理。基于这个想法，本文首先设计了一些简单的prompt来测试LLM在不使用任何图结构的情况下能有多少性能。 本文主要考虑了zero shot, few shot,并且测试了使用思维链Chain of thought的效果。
实验结果如下图所示。LLM在不同的数据集上的性能差异非常大。在Pubmed数据集上，可以看到LLM在zero shot情况下的性能甚至超过了GNN。而在Cora,Arxiv等数据集上，又与GNN有较大的差距。注意，对于这里的GNN，在Cora，CiteSeer，Pubmed上，每一类有20个样本被选为训练集，而Arxiv与Products数据集上有更多的训练样本。相比之下，LLM的预测是基于零样本或者少样本的，而GNN并不具备零样本学习的能力，在少样本的情况下性能也会很差。当然，输入长度的限制也使得LLM无法囊括更多的上下文样本。
通过对实验结果进行分析，在某些情况下LLM预测错的结果也是比较合理的。一个例子如图12所示。可以看到，很多论文本身也是交叉领域的，因此预测时LLM通过自身的常识性信息进行推理，有时并不能与标注的偏好匹配到一起。这也是值得思考的问题：这种单标签的设定是合理的吗？
此外，在Arxiv数据集上LLM的表现最差，这与TAPE中的结论并不一致，因此需要比较一下两者的prompt有什么差异。TAPE使用的prompt如下所示。
Abstract: <abstract text> \n Title: <title text> \n Question: Which arXiv CS sub-categorydoes this paper belong to? Give 5 likely arXiv CS sub-categories as a comma-separated list ordered from most to least likely, in the form “cs.XX”, and provide your reasoning. \n \n Answer:
有意思的是，TAPE甚至都没有在prompt中指明数据集中存在哪些类别，而是直接利用了LLM中存在的关于arxiv的知识信息。奇怪的是，通过这个小变化，LLM预测的性能有巨大的改变，这不禁让人怀疑与本身测试集标签泄漏有关。作为高质量的语料，arxiv上的数据大概率是被包含在了各种LLM的预训练之中，而TAPE的prompt可能使得LLM可以更好地回忆起这些预训练语料。这提醒我们需要重新思考评估的合理性，因为这时的准确率可能反映的并不是prompt的好坏与语言模型的能力，而仅仅只是LLM的记忆问题。以上两个问题都与数据集的评估有关，是非常有价值的未来方向。
进一步地，本文也考虑了能否在prompt中通过文本的形式把结构信息也包含进来。本文测试了几种方式来在prompt中表示结构化的信息。具体地，我们尝试了使用自然语言“连接”来表示边关系以及通过总结周围邻居节点的信息来隐式表达边关系。
结果表明，以下这种隐式表达的方式最为有效。
Paper:<paper content>
NeighborSummary:<Neighborsummary>
Instruction:<Task instruction>
具体来说，模仿GNN的思路，对二阶邻居节点进行采样，然后将对应的文本内容输入到LLM中，让其进行一个总结，作为结构相关信息，一个样例如图13所示。
本文在几个数据集上测试了prompt的有效性，结果如图14所示。在除了Pubmed以外的其他四个数据集上，都可以相对不考虑结构的情况获得一定的提升，反映了方法的有效性。进一步地，本文分析了这个prompt为什么在Pubmed数据集上失效。
在Pubmed数据集上，很多情况下样本的标注会直接出现在样本的文本属性中。一个例子如下所示。由于这个特性的存在，想要在Pubmed数据集上取得比较好的结果，可以通过学习到这种“捷径”，而LLM在此数据集上特别好的表现可能也正源于此。在这种情况下，如果加上总结后的邻居信息，可能反而会使得LLM更难捕捉到这种“捷径”信息，因此性能会下降。
Title: Predictive power of sequential measures of albuminuria for progression to ESRD or death in Pima Indians with type 2 diabetes. … (content omitted here)
Ground truth label: Diabetes Mellitus Type 2
进一步地，在一些邻居与本身标签不同的异配(heterophilous)点上，LLM同GNN一样会受到邻居信息的干扰，从而输出错误的预测。
GNN的异配性也是一个很有意思的研究方向，大家也可以参考我们的论文Demystifying Structural Disparity in Graph Neural Networks: Can One Size Fit All?
从上文的讨论中可以看到，在一些情况下LLM可以取得良好的零样本预测性能，这使得它有代替人工为样本生成标注的潜力。本文初步探索了利用LLM生成标注，然后用这些标注训练GNN的可能性。
针对这一问题，有两个需要研究的点
最后，简要讨论一下本文的局限性，以及一些有意思的后续方向。首先，需要说明的是本文主要针对的还是节点分类这个任务，而这个pipeline要扩展到更多的图学习任务上还需要更多的研究，从这个角度来说标题或许也有一些overclaim。另外，也有一些场景下无法获取有效的节点属性。比如，金融交易网络中，很多情况下用户节点是匿名的，这时如何构造能够让LLM理解的有意义的prompt就成为了新的挑战。
其次，如何降低LLM的使用成本也是一个值得考虑的问题。在文中，讨论了利用LLM进行增强，而这种增强需要使用每个节点作为输入，如果有N个节点，那就需要与LLM有N次交互，有很高的使用成本。在实验过程中，我们也尝试了像Vicuna这类开源的模型，但是生成的内容质量相比ChatGPT还是相去甚远。另外，基于API对ChatGPT进行调用目前也无法批处理化，所以效率也很低。如何在保证性能的情况下降低成本并提升效率，也是值得思考的问题。
最后，一个重要的问题就是LLM的评估。在文中，已经讨论了可能存在的测试集泄漏问题以及单标注设定不合理的问题。要解决第一个问题，一个简单的想法是使用不在大模型预训练语料范围内的数据，但这也需要我们不断地更新数据集并且生成正确的人工标注。对于第二个问题，一个可能的解决办法是使用多标签(multi label)的设定。对于类似arxiv的论文分类数据集，可以通过arxiv本身的类别生成高质量的多标签标注，但对更一般的情况，如何生成正确的标注还是一个难以解决的问题。
[1] Zhao J, Qu M, Li C, et al. Learning on large-scale text-attributed graphs via variational inference[J]. arXiv preprint arXiv:2210.14709, 2022.
[2] Chien E, Chang W C, Hsieh C J, et al. Node feature extraction by self-supervised multi-scale neighborhood prediction[J]. arXiv preprint arXiv:2111.00064, 2021.
[3] He X, Bresson X, Laurent T, et al. Explanations as Features: LLM-Based Features for Text-Attributed Graphs[J]. arXiv preprint arXiv:2305.19523, 2023.
]]>Graph is the fundamental data structure that denotes pairwise relationships between entities across various domains, e.g., web, gene, and molecule. Machine learning on graph, typical on Graph Neural Network, becomes more and more popular in recent years. In this blog, we will introduce some basic concepts of machine learning on graph. We hope it may give you inspiration on:
what is graph? why do we need graph? How to solve graph-related problems with machine learning techniques?
How to correlate your specific task with the graph and view it as a graph problem?
How to utilize existing techniques to solve your specific task?
Before going deep into the technical details, we first provide some motivations by introducing some histories on the developement of graph Neural Network (GNN). The history of GNN is emerged as a response to two significant challenges. The first challenge came from the data mining domain, where researchers were exploring ways to extend deep learning techniques to handle structured network data. Examples of such data include the World Wide Web, relational databases, and citation networks. The second challenge arose from the science domain, where researchers were attempting to apply deep learning techniques to practical science problems such as single-cell analysis, brain network analysis, and molecule property prediction. To meet these practical challenges, the GNN community has grown rapidly, with researchers collaborating across different fields beyond data mining.
The graph is a data formulation that is widely utilized to describe pairwise relations between nodes. Mathematically, a graph can be denoted as \(\mathcal{G}=\left \{\mathcal{V}, \mathcal{E} \right \}\). \(\mathcal{V}= \left \{v_1, v_2, \cdots, v_N \right \}\) is a set of \(N=\left | \mathcal{V} \right |\) nodes. \(\mathcal{E}= \left \{e_1, e_2, \cdots, e_M \right \}\) is a set of \(M=\left | \mathcal{E} \right |\) which describes the connections between nodes. \(e=(v_1, v_2)\) indicates there is an edge exists from node \(v_1\) to node \(v_2\). Moreover, nodes and edges can have corresponding features \(X_V\in \mathbb{R}^{N\times d}\), \(X_E\in \mathbb{R}^{M\times d}\), respectively.
The main advantage of the graph formulation is the universal representation ability.
Universal represents that graph can be a natural representation for arbitrary data. In the data mining domain, much data can be naturally represented as a graph. Examples are shown in Figure 1
Social network [1] can be represented as a graph. Each node represents one user. Each edge indicates that the relationship exits between two users, e.g., friendship, domestic relationship,
Transport Network [2] can be represented as a graph. Each node represents one station. Each edge indicates that a route exists between two stations.
Web Network [3] can be represented as a graph. Each node represents one web page. Each edge indicates that a hyperlink exists between two pages.
(a) Social Network |
(b) Transport Network |
(c) Web Network |
Moreover, the graph can also generalize into different domains. In the computer vision domain, The image can be viewed as a grid graph. In the natural language processing domain, the sentence can be viewed as a path graph. IN AI4Science, the graph can adapt to all scientific problems easily. More concrete examples are shown in Figure 2
Brain network [4] can be represented as a graph. Nodes represent brain regions, and edges represent connections between them. Connections can be structural, such as axonal projections, or functional, such as correlated activity between brain regions. Brain network graphs can be conducted with different scales, ranging from individual neurons and synapses to large-scale brain regions and networks.
Gene-gene network [5] can be represented as a graph. In a gene-gene network, nodes represent genes, and edges represent interactions between them. These interactions can be based on different types of experimental evidence, such as co-expression, co-regulation, or protein-protein interactions. Gene-gene networks can be conducted with different levels of complexity, from small subnetworks involved in specific biological pathways to large-scale networks that span the entire genome.
Molecule network [6] can be represented as a graph. chemical compounds are denoted as graphs with atoms as nodes and chemical bonds as edges. Molecular networks can be conducted with different levels of complexity, from simple compounds such as water and carbon dioxide to complex biomolecules such as proteins and DNA.
(a )Gene-gene Network |
(b) Brain Network |
(c) Molecule Network |
The simple graph mentioned in Section [1.1] shows the most basic formulation of the graph which only takes single node and edge type into consideration. However, different data may have additional features which cannot be easily handled on the single graph formulation.
In this subsection, we will briefly describe popular complex graphs including the heterogeneous graph, bipartite graph, multidimensional graph, signed graph, hypergraph, and dynamic graph.
The bipartite graph formulation is a special single graph where edges can only between two node sets \(\mathcal{V}_1\) and \(\mathcal{V}_2\). Two node sets should have: (1) no overlap between two node sets: \(\mathcal{V}_1 \cap \mathcal{V}_2 = \emptyset\). (2) contains all nodes: \(\mathcal{V}_1 \cup \mathcal{V}_2 = \mathcal{V}\). The bipartite graph is utilized to describe the interactions between different objectives. It is typically utilized in the e-commerce system to describe the interaction between users and documents. It can also be utilized on different science problems.
The signed graph is introduced to describe the graph with two edge types: positive edges and negative edges. A signed graph \(\mathcal{G}\) consists of a set of nodes \(\mathcal{V}=\{v_1, \cdots, v_N \}\) and a set of edges \(\mathcal{E}=\{e_1, \cdots, e_M \}\). Additionally, there is an edge-type mapping function \(\phi_e:\mathcal{E}\to\mathcal{T}_e\) that map each edge to their types, positive or negative. \(\mathcal{T}_e = \left \{1, -1 \right \}\) indicate the edge type, positive or negative. It is typically utilized in social networks like Twitter, where the positive edge indicates following, and the negative edge indicates block or unfollow. It can also be utilized on different science problems.
The heterogeneous graph introduced more node types on the graph. New relationship types can also be found as edges can be found between different node types.
For example, the simple citation network can be represented with the single graph formulation, where each node represents a paper, each edge represents one paper cites another one. However, the citation network can be more complex when considering: (1) authors. authors could have a co-author relationship. The author could also write papers. (2) Paper types. Paper can have different types, e.g., Data Mining, Artificial Intelligence, Computer Vision, and Natural Language Processing.
A Heterogeneous graph \(\mathcal{G}\) consists of a set of nodes \(\mathcal{V}=\{v_1, \cdots, v_N \}\) and a set of edges \(\mathcal{E}=\{e_1, \cdots, e_M \}\). Additionally, there are two mapping functions \(\phi_n:\mathcal{V}\to\mathcal{T}_n\), \(\phi_e:\mathcal{E}\to\mathcal{T}_e\) that map each node and each edge to their types, respectively. \(\mathcal{T}_e\) indicate the set of node an edge type.
Multidimensional graph is introduced to describe multiple relationships that simultaneously exist between a pair of nodes. It is different from the signed graph and the heterogeneous graph that both of them do not allow multiple edges between a pair of nodes. A multidimensional graph consists of a set of \(N\) nodes \(\mathcal{V}= \{v_1, \cdots, v_n \}\) and \(D\) sets of edges \(\{\mathcal{E}_1, \cdots, \mathcal{E}_D \}\). Each edge set \(\mathcal{E}_d\) describes the \(d\)th type of relation between nodes. The intersection between different edge sets is allowed. It is typically utilized in the social network. Users can "like", "Retweet", and "comment" on the tweet. Each action corresponds to one relationship between user and tweet. It can also be utilized on different science problems.
The hypergraph is introduced when you are required to consider the relationship beyond a pair of nodes. A hypergraph \(\mathcal{G}\) consists of a set of \(N\) nodes \(\mathcal{V}= \{v_1, \cdots, v_n \}\) and a set of hyperedges \(\mathcal{E}\). The incident matrix \(\mathbf{H} \in \mathbb{R}^{|\mathcal{V}|\times |\mathcal{E}| }\) instead of using the adjacent matrix \(\mathbf{A}\) is utilized to describe the graph structure.
\[H_{i j} = \begin{cases} 1 & \text{if vertex } v_{i} \text{ is incident with edge } e_{j} \\ 0 & \text{otherwise.} \end{cases} \tag{1}\]It is typically utilized in the academic network. where nodes are papers and authors. One author can publish more than one paper which can be viewed as a hyper-edge connecting multiple papers.
Dynamic graph is introduced when the graph constantly evolves where new nodes and edges may be added and some existing nodes and edges may disappear in the graph. A dynamic graph \(\mathcal{G}\) consists of a set of \(N\) nodes \(\mathcal{V}= \{v_1, \cdots, v_n \}\) and a set of edges \(\mathcal{E}\) where each node and edge is associated with a timestamp indicating the time it emerged. We have two mapping functions \(\phi_v\), and \(\phi_e\) mapping each node and each edge to the timestamps, respectively. It is typically utilized in the social network, where nodes are users on Twitter. There are new users every day and they can follow and unfollow other users from time to time.
Knowledge Graph is an important application on the graph domain. It is comprised of nodes and edges, where nodes \(\mathcal{V}\) represent entities (such as people, places, or objects) and edges \(\mathcal{E}\) represent relationships \(\mathcal{R}\) between these entities. These relationships can be diverse, including semantic relations (e.g., "is a" or "part of"), factual associations (e.g., "born in" or "works at"), or other contextual links. The graph-based structure allows for efficient querying and traversing of data, as well as the ability to infer new knowledge by leveraging existing connections.
A Knowledge Graph is a structured representation of information that aims to model the relationships between entities, facts, and concepts in a comprehensive and interconnected way. It provides a flexible and efficient means of organizing, querying, and deriving insights from large volumes of data, making it a powerful tool for information retrieval and knowledge discovery. It is widely utilized in the Semantic web which enables machines to better understand and interact with web content by organizing information in a machine-readable format.
Remark: In this subsection, we briefly introduce different graph formulations in this subsection. However, the real-world case could be more complicated. For example, The Network in E-commerce could be a Heterogeneous bipartite multidimensional graph. It typically corresponds to the following scenarios: (1) Heterogeneous: Customer and purchaser could be different user types. Different items also have different types. (2) bipartite: Users could only have interactions with the items. (3) multidimensional: Users could have different interactions on the items, e.g., "buy" "add to shopping cart", and so on.
The graph formulations described in this subsection are more like prototypes. You can design the typical graph formulation for your data. It could be easy to learn from the recent progress on the corresponding graph type to your data.
In this subsection, we provide a brief introduction on the graph-related tasks to show how we can utilize the graph on different scenarios. We typically introduce node classification, graph classification, graph generation, link prediction tasks. Most downstream tasks can be viewed as an instance for the above tasks
Node classification aims to identify which class the graph node should belong by utilizing the ego feature, adjacent matrix, and features from other nodes. The node classification task has numerous real-world applications. Examples are as follows: (1) social network analysis: In social networks, nodes and edges represent each individual and social relationships. Node classification can be utilized to predict various attributes, such as interests, affiliation, profession and so on. (2) Bioinformatics: In biological networks, nodes represent genes, proteins, or other biological entities, and the connections between nodes represent interactions such as regulatory or metabolic relationships. Node classification can be utilized to predict various node properties, such as the function, localization, or disease association. (3) Cybersecurity: In network security, nodes represent computers, servers, or other network devices, and the connections between nodes represent communication or access relationships. Node classification can be utilized to detect various types of network attacks or anomalies, such as malware, spam, or intrusion attempts.
Graph classification aims to identify which class the graph should belong with exploiting both rich information from the graph structure and the node feature. Image classification can be viewed as a special case for the graph classification task. Each pixel can be viewed as a node, where RGB is the corresponding node feature. The graph structure on image is a grid which connects the adjacent pixels. Graph classification has been broadly utilized in many real-world applications. Examples are shown as follows. (1) bioinformatics: The graph classification can be utilized to identify biological networks into different categories. For example, we could classify a set of protein-protein interaction networks based on their function or disease association. It can help identify potential drug targets, protein complexes, or pathways, and inform drug discovery. (2) chemistry: The graph classification can be utilized to identify chemical compounds into different categories. For example, we could classify a set of compounds based on their toxicity or therapeutic potential. (3) Social Network Analysis: Graph classification can be utilized to identify the discussion topic of a tweet in Twitter.
Link prediction can be viewed as a binary classification task predicting whether there is a link exists between two nodes on the graph. It could complete the graph and find the under-discovered relationship between nodes. Link prediction has been broadly utilized in many domains. Examples are shown as follows: (1) Friend recommendation in the social network. Twitter could recommend you some friends you may know or interested in. (2) Movie recommendation. Netflix will recommend you the film you may be interest in. (3) bioinformatics: In biological networks, link prediction can be utilized to predict the likelihood of physical interactions between pairs of proteins based on their sequence similarity, domain composition, or other features. It can help identify potential drug targets, protein complexes, or pathways, and inform drug discovery.
In contrast to the aforementioned tasks, graph generation aims to solve the generative problem: given a dataset of graphs, learn to sample new graphs from the learned data distribution. As graph could represent many highly-structured data, graph generation has the promises for design tasks in a variety of domains such as molecular graph generation (drug & materials discovery), circuit network design, in-door layout design, etc.
In this section, we aim to introduce (1) the Graph Neural Networks which have become popular for learning graph representations by jointly leveraging attribute and graph structure information. (2) understanding perspectives on GNN which connect GNN design to other domains, e.g., graph signal process, Weisfeiler-Lehman Isomorphism Test, and so on (3) traditional graph machine learning methods and structure-agnostic methods which may perform even better than GNN
The design of Graph Neural Network is inspired from the Convolution Neural Network which is one of the most widely-used Neural Networks in the computer vision domain. It takes effort to utilize the neighborhood pixel to learn a good representation. Concretely speaking, convolutional Neural Networks extract different feature patterns by aggregating the neighboring pixels in a fixed-size receptive field, for example, a receptive field with \(3\times 3\) neighborhood pixels. To extend the superiority of CNN to the graph, researchers develop the Graph Neural Network. There are two essential problems in developing the Graph Neural network.
How to define the receptive field on graph since it is not a regular grid?
What feature patterns are useful on the graph?
Those two questions lead to two crucial perspectives on designing Graph Neural Networks, spectral and spatial perspectives, respectively. Before going into the details in those details, we first provide a definition of the general Graph Neural Network Framework.
We introduce the general frameworks of GNNs for the most basic node-level task. We first recap some notations on the graph. We denote a graph as \(\mathcal{G}= \left \{ \mathcal{V}, \mathcal{E} \right \}\) (i.e. molecule). The adjacent matrix and the associated features are denoted as \(\mathbf{A}\in \mathbb{R}^{N \times N}\) (i.e. bond type) and \(\mathbf{F}\in \mathbb{R}^{N \times d}\) (i.e. atom type), respectively. \(N\) and \(d\) are the numbers of nodes and feature dimensions, respectively.
A general framework for Graph Neural Networks can be regarded as a composition of \(L\) graph filter layers, and \(L-1\) nonlinear activation layers. \(h_i\) and \(\alpha_i\) are utilized to denote the \(i\)-th graph filter layer, and activation layer, respectively. \(\mathbf{F}_i \in \mathbb{R}^{N\times d_i}\) denotes the output of the \(i\)-th graph filter layer \(h_i\). \(\mathbf{F}_0\) is initialized to be the raw node features \(\mathbf{F}\).
For the image with a regular grid structure, the receptive fields are defined as the neighborhood pixel around the central pixel. An example is illustrated in Fig. . So how to define the receptive field on the graph with no unified regular structure? The answer is the neighborhood nodes along the edge. One hop neighborhood of node \(v_i\) can be defined as \(\mathcal{N}_{v_i} = \left \{ v_j s.t., (v_i, v_j) \in \mathcal{E}\right \}\). To adaptively extract the neighborhood information, a large variety of spatial-based graph filters are proposed. We introduce two typical spatial Graph-filter layers, GraphSAGE and GAT, in this section.
GraphSAGE [7]: The GraphSAGE model proposed in () introduced a spatial-based filter that aggregation information from neighboring nodes. The hidden feature for node \(v_i\) is generated with the following steps.
Sample neighborhood nodes from the neighborhood set. \(\mathcal{N}_S(v_i)=\text{SAMPLE}(\mathcal{N}(v_i), S)\) where \(\text{SAMPLE}()\) is a function that takes the neighborhood set as input, and random sample \(S\) instances as the output.
Extract the information from neighborhood nodes. \(f_i' = \text{AGGREGATE}( \left \{ \mathbf{F}_j, \forall v_j \in \mathcal{N}_S(v_i) \right \} )\) where \(\text{AGGREGATE}: \mathbb{R}^{M\times d} \to \mathbb{R}^{d}\) is a function to combine the information from the neighboring nodes.
combine the neighborhood information with the ego information \(\mathbf{F}_i=\sigma \left ( [\mathbf{F}_i, \mathbf{f}'_i] \mathbf{\Theta} \right )\) where \([\cdot, \cdot]\) is the concatenation operation, \(\Theta\) is the learnable parameters.
The aggregation can be a set function with different aggregators including mean, maximum aggregators, which takes the element-wise mean, and maximum operator. sum aggregator is later introduced by () with stronger expressive ability.
GAT [8]: The Graph Attention Network (GAT) is inspired by the self-attention mechanism. GAT adaptively aggregates the neighborhood information based on the attention score. The hidden feature for node \(v_i\) is generated with the following steps.
generates the attention score with the neighborhood node. \(a(\mathbf{F}_i\mathbf{\Theta}, \mathbf{F}_j\mathbf{\Theta})=\text{LeakyReLU} (\mathbf{a}^T \left [ \mathbf{F}_i\mathbf{\Theta}, \mathbf{F}_j\mathbf{\Theta} \right ]) \text{s.t.},v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}\) where \(a\) is a
normalizes the attention score via softmax. \(\alpha_{ij} = \frac{\exp{e_{ij}}}{\sum_{v_k \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \exp{e_{ik}}}\)
aggregation the weighted information from neighborhoods. \(\mathbf{F}'_i = \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \alpha_{ij}\mathbf{F}_i\mathbf{\Theta}\)
multi-head attention implementation. \(\mathbf{F}'_i = ||_{m=1}^M \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \alpha_{ij}^m \mathbf{F}_j \mathbf{\Theta}^m\) Where \(||\) is the concatenation operator, \(M\) is the number of heads.
Notice that, the key difference between the GAT and self-attention mechanism is that, self-attention is conducted on all the nodes, where the GAT is conducted on the neighborhood nodes. More discussion can be found in the next section.
Spectral-based Graph Filters majorly utilize the spectral graph theory to develop the filter operation in the spectral domain. We will only provide some motivations for Spectral-based Graph Filters without mathematical details.
The motivation behind spectral graph filters is that neighboring nodes in a graph should have similar representations. In the context of spectral graph theory and filters, neighborhood similarity corresponds to the low-frequency components which changes in the graph structure that occur slowly or gradually. Contrastively, high-frequency components correspond to rapid or abrupt changes. By focusing on the low-frequency components, spectral graph filters can capture the underlying smooth variations in the graph topology, which can be useful for various tasks e.g., node classification, link prediction, and graph clustering.
In other words, spectral graph filters aim to identify feature patterns that are smooth and do not vary significantly across different nodes. It corresponds to the low-frequency components of the graph structure based on spectral graph theory.
GCN [9]: We only provide a brief introduction on the formulation of the Graph Convolutional Network (GCN). A more comprehensive study can be found in Section 5.3.1 of Deep Learning on Graphs [10].
The aggregation function of GCN is defined as: \(\mathbf{F}'= \sigma( \tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{F}\mathbf{\Theta}) \tag{2}\) where \(\sigma\) is the activation function, \(\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}\) is the symmetric normalized adjacent matrix.
The aggregation function for each edge can be defined as: \(\mathbf{F}'_i = \sum_{v_j \in \mathcal{N}(v_i) \cup \left \{ v_i \right \}} \frac{1}{\sqrt{\tilde{d}_i\tilde{d}_j}} \mathbf{F}_j\mathbf{\Theta}\tag{3}\) where \(\tilde{d}_i\) is the degree of node \(i\).
The above discussion focused on GNN design for the simple graph with a single node and edge type. Message Passing Neural Network (MPNN) is then proposed as a more general framework that could cover the entire design space for GNNs. Concretely speaking, MPNNs are a family of neural networks that operate on graphs by (1) generating messages between nodes based on their local neighborhoods. (2) aggregating messages from neighboring nodes iteratively to MPNNs can learn powerful graph representations for various downstream tasks.
The above discussion focuses on the GNN design on the simple graph with single node and edge type. Message Passing Neural Network is a re A more general Graph Neural Network. Message passing Neural Network is a more general framework which could cover the whole design space for GNN.
The Message Passing filter can be defined as: \(h_{i}^{\ell+1}=\phi\left(h_{i}^{\ell}, \oplus_{j \in \mathcal{N}_{i}}\left(\psi\left(h_{i}^{\ell}, h_{j}^{\ell}, e_{i j}\right)\right)\right)\tag{4}\) where \(\phi\), \(\psi\) are Multi-Layer Perceptrons (MLPs), and \(\oplus\) is a permutation-invariant local neighborhood aggregation function such as summation, maximization, or averaging.
Focusing on one particular node \(i\), the MPNN layer can be decomposed into three steps as:
Message: For each pair of linked nodes \(i\), \(j\), the network first computes a message \(m_{i j}=\psi\left(h_{i}^{\ell}, h_{j}^{\ell}, e_{i j}\right)\) The MLP \(\psi: \mathbb{R}^{2d+d_e}\to \mathbb{R}^{d}\) takes as input the concatenation of the feature vectors from the source node, target node, and edge feature.
Aggregate: At each source node \(i\), the incoming messages from all its neighbors (target node) are then aggregated as \(m_{i}=\oplus_{j \in \mathcal{N}_{i}}\left(m_{i j}\right)\)
Update: Finally, the network updates the node feature vector \(h_{i}^{\ell+1}=\phi\left(h_{i}^{\ell}, m_{i}\right)\) by concatenating the aggregated message \(m_i\) and the previous node feature vector \(h_i^{\mathcal{l}}\), and passing them through an MLP \(\phi: \mathbb{R}^{2 d} \rightarrow \mathbb{R}^{d}\).
A function \(f\) is said to be equivariant if for any transformation \(\tau\) of the input space \(X\), and any input \(x\in X\), we have: \(f(\tau(x)) = \tau(f(x))\). In other words, applying the transformation \(\tau\) to the input has the same effect as applying it to the output. A function \(f\) is said to be invariant if for any transformation \(\tau\) of the input space \(X\), and any input \(x\in X\), we have: \(f(\tau(x)) = f(x)\). In other words, applying the transformation \(\tau\) to the input does not change the output.
In the context of GNNs, we want to achieve permutation-equivariance or permutation-invariance, which means that the function should be equivariant or invariant to permutations of the input graph. We can express this mathematically by defining a permutation \(\sigma\) of the nodes of the input graph \(G=(V,E)\), and requiring that the output of the GNN is the same regardless of the permutation: \(f(G) = f(\sigma(G))\), where \(\sigma(G)\) is the graph obtained by applying the permutation \(\sigma\) to the nodes of \(G\).
The expressiveness of Graph Neural Network is highly related with the graph isomorphism test. An expressive GNN should map the isomorphic graphs into the same representation and distinguish non-isomorphic graphs with different representations.
The Weisfeiler-Lehman (WL) test is a popular graph isomorphism test used to determine whether two graphs are isomorphic, meaning two graphs have the same underlying structure but may differ in the node labeling. The intuition for WL-test is that if two graphs are isomorphic, then their structures should be similar across all hops of neighborhoods, from one-hop neighborhoods to the global structure of the entire graph. The algorithm iterates on the following two steps: (1) aggregation: collect a set of neighbor node labels (2) labeling: assigned a new label based on the label set of neighbor nodes. The WL-test will repeat this labeling and aggregation process until convergence (node label does not change). We can then identify whether two graphs are isomorphic if they have the same sequence of refined graphs or not. The WL-test is widely utilized in different domains since it is efficient with the time complexity \(O(n \log (n))\), where \(n\) is the number of the nodes. More recently, the WL-test is widely utilized for analyzing the expressiveness of GNN.
Graph Neural Networks and Transformer architectures are typically two popular model architectures to leverage the context information. Connections can be found between those two architectures.
\(\begin{array}{c} h_{i}^{\ell+1}=\operatorname{Attention}\left(Q^{\ell} h_{i}^{\ell}, K^{\ell} h_{j}^{\ell}, V^{\ell} h_{j}^{\ell}\right), \\ i . e ., h_{i}^{\ell+1}=\sum_{j \in \mathcal{S}} w_{i j}\left(V^{\ell} h_{j}^{\ell}\right), \end{array}\) where \(w_{i j}=\operatorname{softmax}_{j}\left(Q^{\ell} h_{i}^{\ell} \cdot K^{\ell} h_{j}^{\ell}\right)\). \(j\in \mathcal{S}\) denotes the set of words in the sentence \(\mathcal{S}\) and \(Q^{\mathcal{l}}, K^{\mathcal{l}}, V^{\mathcal{l}}\) are learnable linear weights. Three matrices denote the Query, Key, and Value for the attention respectively. One update on each word embedding can be viewed as a weighted aggregation of all word embeddings in the sentence. An illustration of self-attention block in Transformer is shown in Fig. 4(b)
One Graph Neural Network block can be defined as follows: \(h_{i}^{\ell+1}=\sigma\left(U^{\ell} h_{i}^{\ell}+\sum_{j \in \mathcal{N}(i)}\left(V^{\ell} h_{j}^{\ell}\right)\right),\tag{5}\) where \(U^{\mathcal{l}}, V^{\mathcal{l}}\) are learnable transformation matrices of the GNN layer and \(\sigma\) is the non-linearity activation function. One update for the hidden representation \(h_i\) of node \(i\)at layer \(\mathcal{l}\) be viewed as a weighted aggregation of neighborhood nodes representation \(j\in \mathcal{N}(i)\).
An illustration of GNN block is shown in Fig. 4(a)
(a) GNN block |
(b) Transformer block |
The key difference between Graph Neural Network and transformer is that Graph Neural Network only aggregates on the neighborhood nodes, while Transformer will aggregate on all the words in the sentence. In another word, Transformer can be viewed as a GNN aggregated on a fully-connected word graph. In other words, both Graph Neural Network and Transformer aim to learn good representation by incorporating context information. Transformer recognizes all the words in one sentence are useful while GNN only recognizes that the neighborhood nodes are useful.
Graph signal denoising [12] offers a new perspective to create a uniform understanding on representative aggregation operations.
The graph signal denoising is to recover a clean signal from the original noisy signal. It can be defined as solving the following optimization problem: \(arg \min_F \mathcal{L}=||F-S||_F^2 + c \cdot \text{tr}(F^TLF) \tag{6}\) where \(S\in \mathbb{R}^{N\times d}\) is a noisy signal (input feature) on graph \(\mathcal{G}\). \(F\in \mathbb{R}^{N\times d}\) is the clean signal assumed to be smooth over \(\mathcal{G}\).
The first term guides \(F\) to be close to \(S\), while the second term \(tr(F^TLF)\) is the Laplacian regularization which guides \(F\)’s smoothness over \(\mathcal{G}\), with \(c > 0\)’s mediation. Assuming that we adopt the unnormalized version of Laplacian matrix with \(L = D - A\) (the adjacency matrix \(A\) is assumed to be binary), the second term can be written in an edge-centric way as: \(c \sum_{(i,j)\in \mathcal{E}} ||F_i-F_j||_2^2\tag{7}\) which leads to the connected nodes sharing similar features.
We show the connection between the graph signal process and GCN as an
example here. The gradient with respect to \(F\) at \(S\) is
\(\frac{\partial \mathcal{L}}{\partial F} \|_{F = S} = 2cLS\)
Hence, one-step gradient descent for the graph signal denoising
problem equation [8] can be described as:
\begin{aligned}
F \leftarrow S - b\left. \frac{\partial \mathcal{L}}{\partial F} \right|_{F = X} &= S - 2bcLS \nonumber
&= (1-2bc )S+ 2bc\tilde{A}S.
\end{aligned}
When stepsize \(b=\frac{1}{2c}\) and \({ S}={ X}'\),
we have \(F \leftarrow \tilde{A}X'\), which is the same as
the aggregation operation of GCN. It provides a new perspective to
understand existing GNNs as a tradeoff between the original feature
preservation and neighborhood smoothness. Moreover, it inspires us to
derive new Graph Neural Networks from different graph signal processing
methods.
A new physical-inspired perspective is to understand Graph Neural Network as a discrete dynamical system of particle [13]. Each node on the graph corresponds to one particle while the edge represents pair-wise interactions between nodes. The positive and negative interactions between nodes could be interpreted as attraction and repulsion between particles, respectively.
To view Graph Neural Network as a discrete dynamical system, one can correspond the input forward layer by layer as the input evolution by a system of differential equations. Each discrete time step in the dynamic system corresponds to one layer forward process.
Gradient flow is a special type of evolution equation of the form \(f(X(t))=- \nabla \mathcal{E}(X(t))\tag{9}\) where \(\mathcal{E}\) is an energy functional, which could be different for different GNNs. The gradient flow makes \(\mathcal{E}\) monotonically decrease during the evolution.
Simple GNN can be viewed as the gradient flow of the Dirichlet energy \(\mathcal{E}^{\text{DIR}} =\frac{1}{2} \text{trace}(X^TLX)\tag{10}\) The Dirichlet energy measures the smoothness of the features on the graph. In the limit \(t\to \infty\), all node features is extremely smooth that all the nodes become the same. It indicates that the system loses the information contained in the input features. This phenomenon is called ‘oversmoothing’ in the GNN literature.
To design better Graph Neural Network to overcome drawback like oversmooth, we can parametrise an energy and deriving a GNN as its discretised gradient flow. It offers better interpretability and leads to more effective architectures.
Dynamic programming on graphs is a technique that involves solving problems by breaking them down into smaller subproblems and finding optimal solutions to those subproblems. This approach can be used to solve a wide range of problems on graphs, including shortest path problems, maximum flow problems, and minimum spanning tree problems. Such an approach shares the similar idea with the aggregation operation on GNN which recursively combines information from neighboring nodes to update the representation of a given node. Both GNN aggregation and dynamic programming on graphs involve combining information from neighboring nodes to update the representation of a given node. In dynamic programming, the combination of information is typically done by recursively solving subproblems and building up a solution to a larger problem. Similarly, in GNN aggregation, neighboring node information is combined through various aggregation functions (e.g. mean, max, sum), and the updated node representation is then passed to subsequent layers in the network. In both cases, the goal is to efficiently compute a global solution by leveraging local information from neighboring nodes. However, vanilla GNNs cannot solve most dynamic programming problems, e.g., shortest path algorithm, and generalized Bellman-Ford algorithm, without capturing the underlying logic and structure of the corresponding problem. To empower GNN with the reasoning ability in dynamic programming, multiple operators are then proposed to generalize the operation in dynamic programming to the Neural Network, e.g., the sum generalizes to a commutative summation operator \(\oplus\), the product \(\otimes\) generalizes to a Hadamard product operator. GNNs can then be extended with different dynamic programming algorithms with improving generalization ability. A simple example of the Graph Neural Network extending to the Bellman-Ford algorithm can be found in Figure. 4
Graph Neural Network is well-recognized as a powerful method for machine learning on graph. However, GNN is still not the dominant method in the graph domain. Traditional machine learning methods on graph and non-graph methods still reveal advantages over the Graph Neural Network. They still hold an important position on graph research and inspire the design of the new Graph Neural Network. In this section, we will first introduce some important machine learning methods beyond graph including Graph Kernel methods for graph classification, label propagation for node classification, and heuristic methods for link prediction.
Label Propagation is a simple but effective method for node classification in graphs. It is a semi-supervised learning technique that leverages the idea that nodes that are connected in a graph are likely to share the same label or class. For example, it could be utilized to a network of people with two labels "interested in cricket" and "not interested in cricket". We only know the interests of a few people and we aim to predict the interests of the remaining unlabeled nodes.
The procedure of label propagation can be found as follows. \(A\) be the \(n \times n\) adjacency matrix of the graph, where \(A_{ij}\) is 1 if there is an edge between nodes \(i\) and \(j\), and 0 otherwise. Let \(Y\) be the \(n \times c\) matrix of node labels, where \(Y_{ij}\) is 1 if node \(i\) belongs to class \(j\), and 0 otherwise. Let \(F\) be the \(n \times c\) matrix of label distributions, where \(F_{ij}^{(t)}\) is the probability of node \(i\) belonging to class \(j\) at iteration \(t\).
At each iteration \(t\), the label distribution \(F^{(t)}\) is updated based on the label distributions of the neighboring nodes as follows:
\(F^{(t)}=AF^{(t-1)}D^{-1}\tag{11}\) where \(D\) is the diagonal degree matrix of the graph, where \(D_{ii} = \sum_j A_{ij}\).
After a certain number of iterations or when the label distributions converge, the labels of the nodes are assigned according to the label distribution with the highest probability:
\[Y_i = arg\max_j F^{(t)}_{ij}\tag{12}\]This process is repeated until the labels converge to a stable state or until a stopping criterion is met.
\(\hat{\mathbf{Y}}=(\mathbf{D}^{-1}\mathbf{A})^t\mathbf{Y}\tag{13}\) where \(\mathbf{D}\) and \(\mathbf{A}\) is the degree matrix and adjacent matrix, respectively. \(t\) is the number of propagation. \(\mathbf{Y}=\begin{bmatrix} \mathbf{Y}_l \\ \mathbf{0} \end{bmatrix}\) is the vector of labels on nodes. \(\mathbf{D}^{-1}\mathbf{A}\) is the transition matrix.
Graph Kernel method is to measure the similarity between two graphs with a kernel function which corresponds to an inner product in reproducing kernel Hilbert space (RKHS). Kernel methods are widely utilized in the Support Vector Machine. It allows us to model higher-order features in the original feature space without computing the coordinates of the data in a higher dimensional space. Graph kernel methods confront additional challenges than the general kernel methods on how to encode the similarity on the graph structure. The design of graph kernel methods focuses on finding suitable graph patterns to measure similarity. We will briefly introduce the subgraph pattern and path pattern on graph kernels.
Graph kernels based on subgraphs aims to find the same subgraph between graphs. Two graphs with more same subgraphs are more similar. Subgraph set can be defined by the graphlet, which is an induced and non-isomorphic sub-graph of node size-\(k\). An illustration can be found in Fig.3 A pattern count vector \(\mathbf{f}\) will be calculated where \(i^{\text{th}}\) component denotes the frequency of subgraph pattern \(i\) occurs.
The graph kernel can then be defined as: \(\mathcal{K}_{\text{GK}}(\mathcal{G}, \mathcal{G}')= \left \langle \mathbf{f}^{\mathcal{G}} \mathbf{f}^{\mathcal{G}'} \right \rangle\) where \(\mathcal{G}\) and \(\mathcal{G}'\) are two graph, \(\left \langle \cdot, \cdot \right \rangle\) denotes the Euclidean dot product.
Graph kernels based on path decomposes a graph into paths. It takes the co-occurrence of random-walk on two graphs to calculate the similarity. Different from the subgraph-based methods focusing on the graph structure, random-walk based method takes the node label in the graph into consideration. It counts all shortest paths in graph \(\mathcal{G}\) denoting as triplets \(p_i=(l_s^i, l_e^i, n_k )\). \(n_k\)is the length of the path. \(l_s^i\) and \(l_e^i\) are the labels of the starting and ending vertices, respectively.
Similarly, the graph kernel can be defined as: \(\mathcal{K}_{\text{GK}}(\mathcal{G}, \mathcal{G}')= \left \langle \mathbf{f}^{\mathcal{G}} \mathbf{f}^{\mathcal{G}'} \right \rangle\tag{15}\) where the \(i^{\text{th}}\) component of \(\mathbf{f}\) denotes the frequency of triplet occurring.
Heuristic methods, i.e., Common Neighbor, utilize the graph structure to estimate the likelihood of the existence of links. We will briefly introduce some basic heuristic methods including common neighbors, Jaccard score, preferential attachment, and Katz index. \(\Gamma(x)\) denote the neighbor node set of \(x\). \(x\) and \(y\) denote two different nodes.
Common Neighbors (CN): The Common Neighbors algorithm considers two nodes with more overlapping neighbor nodes are more likely to be connected. Common neighbors algorithm calculates the intersection between neighbor nodes of node \(x\) and node \(y\). \(f_{\text{CN}}(x,y)=| \Gamma(x) \cap \Gamma(y) |\tag{16}\)
Jaccard score: Jaccard score can be viewed as a normalized Common Neighbors algorithm, where the normalized factor is union of node sets. \(f_{\text{Jaccard}}(x,y)=\frac{| \Gamma(x) \cap \Gamma(y) |}{| \Gamma(x) \cup \Gamma(y) |}\tag{17}\)
Preferential attachment (PA): Preferential attachment algorithms consider that nodes with higher degrees are more likely to be connected. Preferential attachment calculates the product of node degrees. \(f_{\text{PA}}(x,y)=| \Gamma(x) | \times | \Gamma(y) |\tag{18}\)
Katz index Katz index algorithm takes high-order nodes into consideration compared with the above algorithms based on one hop neighborhood. Katz index considers that nodes with more short paths are more likely to be connected. It calculates the weighted sum of all the walks between \(x\) and \(y\) as follows: \(f_{\text{Katz}}(x,y)= \sum_{l=1}^{\infty}\beta^l |\text{walks}^{\left \langle l \right \rangle }(x,y)|\tag{19}\) \(\beta\) is a decaying factor between 0 and 1, which gives a smaller weight to distant path. \(|\text{walks}^{\left \langle l \right \rangle }\) counts the length between \(x\) and \(y\).
In this section, we will first introduce some general tips for applying graph machine learning in scientific discovery followed by two success examples in molecular science and social science.
If your task focuses on a single large graph, it may meet the out-of-memory issue. We suggest you (1) utilize sampling strategies (2) less propagation layer without involving too many neighbors.
If your task focuses on multiple small graphs, time efficiency may be an issue. (Seems that GNN can be very slow on mini-batch task)
feature matters: if your graph node does not have the feature, you can conduct the feature manually. Some suggested features are degree, Laplacian Eigenvector, DeepWalk embedding.
feature normalization may heavily influence the performance of GNN models.
add self-loop may provide additional gain to your model
The performance on single data split may not be reliable. Try different data splits for reliable performance.
If your data does not naturally have the graph structure, it may not be necessary to conduct graph structure manually to apply GNN methods on.
GNN is a permutation equivalence Neurel Network. It may not work well on tasks requiring other geometric properties and also nodes related to other information.
Algorithm 1:
Input: molecule, radius R, fingerprint length S
Initialize:
fingerprint vector f ← 0_S
r_a ← g(a)
r_1 ... r_N = neighbors(a)
v ← [r_a, r_1, ..., r_N]
r_a ← hash(v)
i ← mod(r_a, S)
f_i ← 1
Return: binary vector f
Algorithm 2:
Input: molecule, radius R, hidden weights H_1^1 ... H_R^5, output weights W_1 ... W_R
Initialize:
fingerprint vector f ← 0_S
r_a ← g(a)
r_1 ... r_N = neighbors(a)
v ← r_a + Σ_{i=1}^{N} r_i
r_a ← σ(vH_L^N)
i ← softmax(r_a W_L)
f ← f + i
Return: real-valued vector f
Molecules are one of the most common applications for graph neural networks, especially message passing neural networks. Molecules are naturally graph objects and GNNs provide a compact way to learn representations on molecular graphs. This line of work has been opened up by a seminal work NEF [16] where they built a neat connection between the process of constructing the most commonly used structure representation (molecular fingerprints) and graph convolutions. As shown in Algorithm [2]. It is worth noting that the commonly used string encoding for molecules (SMILES — Simplified Molecular Input Line Entry System) could be considered as a parsing tree (implicit graph representation) defined by the grammar.
There are mainly two branches of problems that have been discovered extensively with graph representation and graph neural networks: (1) predictive task, and (2) generative task. Predictive task refers to answering a specific question about certain molecules, such as the toxicity, energy, etc. of any given molecules. This is particularly beneficial for tasks like virtual screening which otherwise requires experiments to obtain the property of molecules. On the other hand, generative task aims to design and discover new molecules with certain interesting properties which is also called molecular inverse design. For predictive task, graph representation provides an efficient and effective way to encode the graph structure of molecules and lead to better performance in any downstream tasks of interest. For generative task, graph representation enables us to design the generative process in a more flexible way as the graph representation can be mapped to molecules deterministically.
Another research hot spot in modeling molecules with graphs is molecular pre-training which arises from the real-world applications. As the chemical space is gigantic (estimated to be \(10^{23}\) to \(10^{60}\) for small drug-like molecules, our explored areas are very limited. However, we have much more access to molecular structures without property annotations. This motivates the research into leveraging unlabeled molecular structures to learn general and transfferable representations which could be fine-tuned in any task even with a small amount of available labeled data.
Last but not least, the work we briefly talked about above is mostly about small drug-like molecules. However, graph representation is much more widely applied in a variety of molecules, such as proteins, RNAs (large bio-molecules), crystal structures or materials (with periodicity), etc. Also, we mainly focus on 2D graph representation in this blog, we will defer discussions about 3D graph representation to a later blog.
Graphs are naturally well-suited as a mathematical formalism for describing and understanding social systems, which usually involve a number of people and their interpersonal relationships or interactions. The most well-known practice in this regard is the concept of social networks, where each person is represented by a vertex (node), and the interaction or relationship between two persons, if any, is represented by an edge (link).
The practice of using graphs to study social systems dates back to the 1930s when Jacob Moreno, a pioneering psychiatrist and educator, proposed the graphic representation of a person’s social link, known as the sociogram [20]. The approach was mathematically formulated in the 1950s and became common in social science later in the 1980s.
Zachary’s karate club. To motivate the study of social networks, it is worth introducing Zachary’s karate club [21] as an example to start with. Zachary’s karate club refers to a university karate club studied by Wayne Zachary in the 1970s. The club had 34 members. If two members interacted outside the club, Zachary created an edge between their corresponding nodes in the social network representation. Figure 10 shows the resulted social network. What makes this social network interesting is that during Zachary’s study, a conflict arose between two senior members (node 1 and node 34 in the figure) of the club. All other members had to chosen sides among the two senior members, essentially leading to a split of the club into two subgroups (i.e. “communities”). As the figure shows, there are two communities of nodes centered at node 1 and 34 respectively.
Zachary further analyzed this network, and found that the exact split of club members can be identified purely based on the structure of the social network. Briefly speaking, Zachary runs a min-cut algorithm on the collected social network. The min-cut algorithm essentially serves to return a group of edges as the “bottleneck” spot of the whole social network. The nodes on different sides of the “bottleneck” are determined to belong to different splits. It turned out that Zachary was able to precisely identify the community belongings for all nodes except node 9 (which indeed lies right on the boundary as the figure shows). This example has often been used as a great example to suggest the fact that social networks (graphs) are a powerful formalism for revealing the underlying organizational truths of social systems.
Important domains of study. The research of social networks grew rapidly in the past few decades, and has spawned many branches. Exhausting all those branches will certainly go beyond the scope and capacity of this blog. Hereby we briefly survey a few of the most influential ones as the following.
Static structure. The first step towards understanding social networks is to analyze their static structural properties. The effort involves the development of scientific measures to quantify those properties, and the empirical measurement of them on real-world social networks. Generally speaking, a social network can be analyzed at local and global levels.
At local level, node centrality measures the “importance” of a person with respect to the whole network. Popular examples include degree centrality, betweenness centrality [22], closeness centrality [23], eigenvector centrality [24], PageRank centrality [25], etc. These measures differ by the different aspects of social importance they emphasize on. For example, the eigenvector centrality \(x_i\) of a person (node) \(i\) is defined in a recursive manner as: $$\begin{aligned} x_i = \frac{1}{\lambda} \sum_{j\in\mathcal{N}(i)} x_j
\end{aligned}\(where\)\lambda$$ is the largest eigenvalue of the adjacency matrix of the social network (and is guaranteed to be a real, positive number). This centrality measure is underpinned by the principle that a person’s role is considered more important if that person have connections with more important people. We refer interested readers to [24] for more details.
Besides node centrality, another example of local measurement is clustering coefficients [26], which measures the tendency of “triadic closure” around a center node: $$\begin{aligned} c_i = \frac{|{e_{jk} \in E: j,k\in \mathcal{N}(i)}|}{k_i(k_1-1)/2}
\end{aligned}$$
At global level, network distances and modularity are two measures for characterizing the macro structure of a social network. Popular network distance measures include shortest-path distances, random-walk-based distances, and (physics-inspired) resistance distance. Conceptually, they may be viewed as quantifiers of “difficulty” to travel along the edges of the social network from one node to another. Modularity often accompanies the important task of community detection for social networks. It measures the strength of division of a social network into groups or clusters of well-connected people.
Dynamic structure. Real-world social interactions often involve time-evolving processes. Therefore, many studies on social networks explicitly incorporate temporal information into the modeling. The task of link prediction, for example, has often been introduced in attempts to model the evolution of a social network. The task predicts whether a link will appear between two people at some (given) future time, and thereby predicting the evolution of the social network. Another area where dynamic structures of social networks are often discussed is when they are used to model face-to-face social interactions. Some of the most recent works on this regard abstract people’s interaction traits such as eye movement , eye gazing, “speaking to” or “listening to” relationships into attribute-rich dynamic links. It is believed that these dynamic interactions carry crucial information about the social event and people’s personalities. Therefore, using a temporal graph that explicitly models these interactions would greatly help the analysis of social interactions of such kind. For example, in [@wang2021tedic], researchers found that using a temporal graph to build prediction models helps machines to achieve state-of-the-art accuracy in identifying lying, dominance, as well as nervousness of people when they interact with each other in a role-playing game.
Information flow. Sometimes the structure of social networks is not the ultimate target of interest to researcher. Instead, people care about the fact that their opinions and decision making process are often affected by their social interactions with friends and acquaintances. Therefore, social networks are often regarded as the infrastructure on which information flows and opinion propagates. It is thus crucial to know how social networks of different structures can affect the spreading of information. A long line of works, for example, has been focusing on modeling the so-called opinion dynamics on social networks. Research in this area has seen such successful applications to viral marketing [28], international negotiations [29], as well as resource allocation [30].
There are many opinion dynamics models, and all of which are essentially mathematical models that describes how people’s opinion(s) on some matters, represented as numerical value(s), dynamically affect each other following some mathematical rules that rely on the network structure. Some of the most popular opinion dynamic models include voter’s model [31], Snajzd Model [32], Ising model [33], Hegselmann-Krause (HK) model [34], Friedkin-Johnsen (FJ) model [35] etc. Here we introduce Friedkin-Johnsen model as an example. The FJ model is not popular as a hot area to study by social scientists in recent years, but is also to date the only model on which a sustained line of human-subject experiments has confirmed the model’s predictions of opinion changes. The basic assumption of FJ model two opinions helf by each person \(i\) in the social network: an internal opinion \(s_i\) that is always fixed, and an external opinion \(z_i\) that evolves in adaption to \(i\)’s internal opinion and its neighbors’ external opinions. The evolution of external opinion \(s_i\) along time steps follows the rule: \(\begin{aligned} z^{0}_i &= s_i\\ z^{t+1}_i &= \frac{s^t_i+\sum_{j\in N_i}a_{ij}z^t_i}{1+\sum_{j\in N_i}a_{ij}} \end{aligned}\)
where \(N_i\) is the neighbors of node \(i\), \(a_{ij}\) is the interaction strength between persons \(i\) and \(j\).
One very elegant property of the FJ model is that the expressed opinions will reach a closed-form equilibrium eventually: \(\begin{aligned} z^{\infty} = (I+L)^{-1}s \end{aligned}\) where \(z^{\infty}, s\in \mathbb{R}^{|V|}\) are the opinion vectors. This closed-form equilibrium brings tremendous convenience for the many follow-up works [36,37,38,39] to further define indices of, for example, polarization, disagreement, and conflict on the equilibrium opinions.
Graph Neural Networks Foundations, Frontiers, and Applications
Graph Signal Processing: Overview, Challenges, and Applications
[1] Linton Freeman. The development of social network analysis. A Study in the Sociology of Science, 1(687):159–167, 2004.
[2] Michael GH Bell, Yasunori Iida, et al. Transportation network analysis. 1997.
[3] Jon Kleinberg and Steve Lawrence. The structure of the web. Science, 294(5548):1849–1850, 2001.
[4] Ed Bullmore and Olaf Sporns. The economy of brain network organization. Nature reviews neuroscience, 13(5):336–349, 2012.
[5] Kristel Van Steen. Travelling the world of gene–gene interactions. Briefings in bioinformatics, 13(1):1–19, 2012.
[6] Minoru Kanehisa, Susumu Goto, Miho Furumichi, Mao Tanabe, and Mika Hirakawa. Kegg for representation and analysis of molecular networks involving diseases and drugs. Nucleic acids research, 38(suppl_1):D355–D360, 2010.
[7] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
[8] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
[9] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[10] Yao Ma and Jiliang Tang. Deep learning on graphs. Cambridge University Press, 2021.
[11] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations.
[12] Yao Ma, Xiaorui Liu, Tong Zhao, Yozen Liu, Jiliang Tang, and Neil Shah. A unified view on graph neural networks as graph signal denoising. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1202–1211, 2021.
[13] Francesco Di Giovanni, James Rowbottom, Benjamin P Chamberlain, Thomas Markovich, and Michael M Bronstein. Graph neural networks as gradient flows. arXiv preprint arXiv:2206.10991, 2022.
[14] Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S Du, Ken-ichi Kawarabayashi, and Stefanie Jegelka. What can neural networks reason about? In International Conference on Learning Representations.
[15] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1365–1374, 2015.
[16] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. Advances in neural information processing systems, 28, 2015.
[17] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018.
[18] Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
[19] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022.
[20] Jacob Levy Moreno. Who shall survive?: A new approach to the problem of human interrelations. 1934.
[21] Wayne W Zachary. An information flow model for conflict and fission in small groups. Journal of anthropological research, 33(4):452–473, 1977.
[22] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35–41, 1977.
[23] Alex Bavelas. Communication patterns in task-oriented groups. The journal of the acoustical society of America, 22(6):725–730, 1950.
[24] Mark EJ Newman. The mathematics of networks. The new palgrave encyclopedia of economics, 2(2008):1–12, 2008.
[25] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998.
[26] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
[27] Yanbang Wang, Pan Li, Chongyang Bai, and Jure Leskovec. Tedic: Neural modeling of behavioral patterns in dynamic social interaction networks. In Proceedings of the Web Conference 2021, pages 693–705, 2021.
[28] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208, 2009.
[29] Carmela Bernardo, Lingfei Wang, Francesco Vasca, Yiguang Hong, Guodong Shi, and Claudio Altafini. Achieving consensus in multilateral international negotiations: The case study of the 2015 paris agreement on climate change. Science Advances, 7(51):eabg8068, 2021.
[30] Noah E Friedkin, Anton V Proskurnikov, Wenjun Mei, and Francesco Bullo. Mathematical structures in group decision-making on resource allocation distributions. Scientific reports, 9(1):1377, 2019.
[31] Richard A Holley and Thomas M Liggett. Ergodic theorems for weakly interacting infinite systems and the voter model. The annals of probability, pages 643–663, 1975.
[32] Katarzyna Sznajd-Weron and Jozef Sznajd. Opinion evolution in closed community. International Journal of Modern Physics C, 11(06):1157–1165, 2000.
[33] Sergey N Dorogovtsev, Alexander V Goltsev, and José Fernando F Mendes. Ising model on networks with an arbitrary distribution of connections. Physical Review E, 66(1):016104, 2002.
[34] Hegselmann Rainer and Ulrich Krause. Opinion dynamics and bounded confidence: models, analysis and simulation. 2002.
[35] Noah E Friedkin and Eugene C Johnsen. Social influence and opinions. Journal of Mathematical Sociology, 15(3-4):193–206, 1990.
[36] Cameron Musco, Christopher Musco, and Charalampos E Tsourakakis. Minimizing polarization and disagreement in social networks. In Proceedings of the 2018 world wide web conference, pages 369–378, 2018.
[37] Christopher Musco, Indu Ramesh, Johan Ugander, and R Teal Witter. How to quantify polarization in models of opinion dynamics. arXiv preprint arXiv:2110.11981, 2021.
[38] Xi Chen, Jefrey Lijffijt, and Tijl De Bie. Quantifying and minimizing risk of conflict in social networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1197–1205, 2018.
[39] Shahrzad Haddadan, Cristina Menghini, Matteo Riondato, and Eli Upfal. Repbublik: Reducing polarized bubble radius with link insertions. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 139–147, 2021.
]]>This blog will change from time to time, since I am learning on those time.
The last update: 2023/05/25
Let me first narrow down the topic a little bit, or just take the graph research as an example.
A research paper can be categorized into different types, I will definitely prefer some while not focusing on others. Nonetheles, they are also good research.
An simple birdview can be found as follows:
To me, I will ask the following question:
Does the paper follows the common evaluation setting?
If I am familar with this research topic, the first thing I will do is to check whether the experimental setting has some tricks. This typically happens on some graph tasks. There could be some specific reasons, however, in most cases, the main reason is that their methods cannot work well. Moreover, it avoid that first believe in the overclaim in most paper, e.g., learn good graph representation for various downstream task, but only has the experiments in node classification.
Does the paper tries to define some new problems?
I think a paper can be a good paper once if it defines a novel problem, It could very hard to define a meaningful task from the real-world scenario. I just take some example.
If the paper is to propose a new solution to the existing problem, why it is neccerary to propose such method?
In most cases, I have the tolerence that the solution is not good if they define a new problem. However, if it only focuses on solving problem, there are many A+B type paper. Typically in Graph domain, once you see this year ICML, NeurIPS and ICLR, you will know what will next KDD and WWW. As far as I see, I still do not know the reason for utilizing diffusion model for some classification task in graph domain. I do think such paper is good for junior students to learn how to write the paper, make a method works, Nonetheless, there will always be some one who do the same thing with you. If you do not implement this idea, it is very likely that this idea will be done in less than one year. I would like to focus on more important topic that push the domain goes further.
If the paper is a theoritical inspired idea? What is the key intution underlying? Is there a toy example explaining the intuition?
]]>能让我在最美好的年华遇见最美好的你。
感谢你能在我的生命中出现
和你的每一次相遇都是一次成长，
你似流星，
给我黯淡的世界增添微弱的光芒，
我们把酒言欢，肆意妄为，
青春的火焰相撞，
我有无数个理由，
想留下这些个夜晚
感谢你能在我的生命中出现，
这些年，一个个你从我的身边流逝。
我试图给你们我对你们的建议，
但人生本不需要建议，
有些事，
不经历是不会明白的，
我期待着又不期待着，
你会撞破南墙，
我们必须分别，
为自由，为理想，为生活，为平等，或是时代的大手轻轻挥舞。
感谢你能在我的生命中出现，
我们有着可爱的，青春的，振奋人心的，又不卑不亢的友谊，
我与你，最平等，而真挚的关系
我们不会攀比生命的高低，
而能共同欣赏一片落叶的坠落。
任何存在力量差距的，命令或是建议，喧闹着，充斥了我的生活，
是的，生活有很多无奈，有太多的苦难。
而你给我的生命增添了一个，发现美的窗户。
照亮我的星空后，你会离去。
我亲爱的朋友，感谢你出现在我的生命中，
还认识早晨吗？
昨夜你曾说过，愿夜幕永不开启。
还记得那夜星空吗？
我从未不曾忘记那片海，那份初心。
蓝天下，我最美好的年华。
或许我们此生不会再相见。
道别的季节，不是已经哭过了吗？
我的脑海还浮现着，你那时的大笑。
温热的夏风中，那青春躁动的声响。
]]>写在前面的话：本文写于2022年10月22日，我来到美国整整两个月的时间了，刚刚到来的陌生和不适应，事情一件接一件砸到头上，喘不过来气，上个周末去了北密有了一个短暂的放松。
这周六我给自己短暂的放了一天假，虽然还有很多事情缠绕一身，简单列一下：1. 实验框架搭建 2. 读相关paper 3. AI foundation作业和project 4. LoG和ICLR审稿 5. specific topic project作业 6. AI4science101文档撰写 7. 读财报。 8. 周日还有一个idea 讨论 似乎这一天的半休假也是一种奢侈。这两天睡了很多觉，因为真的很辛苦，虽然有充足的睡眠，但是从起床开始就被琐事缠身的感觉并不是很好。
现在外面阳光正好，我写下了下面的文章，希望这篇文章里能给你最真实的感动，对理想的认知，与未来的悸动。
人生到处知何似，应似飞鸿踏雪泥： 泥上偶然留指爪，鸿飞那复计东西。 老僧已死成新塔，坏壁无由见旧题。 往日崎岖还记否，路上人困蹇驴嘶。
傍晚，广州的太阳不再肆无忌惮，隐于薄雾中，光也低下了头，从广州白云国际机场的大厅照到我们每个人的脸上，往日里我会笑一句，丁达尔效应！而今天，这或许是我在故土的最后一晚了。飞机滴滴的驶过，一趟空中列车，四十小时的旅程。此时，我在广州的正上方，夜幕已然降临，广州人散去一天的热烈，夜生活开启了，而我正于城市上方，和这片生我养我22年的土地说告别。我或许只能化作天空里的一片云，偶尔掠过你的心里，请不要惊讶，亦不要欣喜，我只会追随理想的热风，他把我吹向那里，我就会飞向哪里。
正午，汽车滴滴的飞过，我坐在里面，兰辛的太阳隐匿于水雾之中，我想摇下汽车玻璃，把脸探出去，去感受风车，感受玉米田地，去被铺面而来的漫山红叶砸一个满怀。人的寿命不过百岁，岁月易逝，草木枯荣，但秋日，生命不遗余力的绽放，与风声相融，延绵时序千载，沧海桑田，烟花易冷，是否能有初见自然伟力鬼斧神工时那种屏住呼吸的震撼。
上午，阴雨绵绵，船解开了绳索，汽笛呜呜的响起，外面阴雨绵绵，而我，在船上，冷冷的冰雨打在身上，远山，红黄绿交错，忽明忽现。水面展开，劈天裂地。山水相逢，海枯石烂。手里抓住一片叶子，揣摩叶子的纹理，突然想起朴树的一首歌，”等着杨树叶落下，眼睛不眨“，除了小时候，自己已经很多年没有这样的情趣了。手指微微松开，叶子随着湖风，刹那不知飘向何方，或许此生他的美丽只会留在我的眼帘，而现在，他的美丽被留在了这段文字之中。
夜深了，我在梦里，我最亲爱的爸爸妈妈坐在一张餐桌的对面，我对他们说，我比原来更懂得生命的意义了，你们放心吧，我会有属于自己的人生。
傍晚，夕阳霞光映在眼帘，我和至喻在meijer买了周末做饭的食品，开车回家，我在车里放着那些花儿，无言，只沉默，这劳累生活。坐在车上，我们不知道应该奔向何方。前路遥不可及，后路渐远。世事无常，我想起了我的那些朋友，有的在花一般的年龄还没绽放自己的光芒似乎就要枯萎了，有的人我或许需要用一生来思念。两行热泪眼眶打转，你有见到过我的那些花吗，我的花他还好吗？
清晨，我睁开双眼，拉开百叶窗，阳光洒进来，我的窗外有一棵树，落叶满地，落在我的窗头。他沉默着陪伴我的每一天。我忽然想起我成电的寝室外那颗孤零零的银杏树。成都的天气总是阴云密布，我的寝室在一楼，四季无光，阳台的凳子上，黄叶落下，我看着他将一把把的叶子洒落在我的心里，四年后，我走了，没在寝室留下一丝住过的痕迹。那颗银杏树还好吗？
当飞鸿远去之后，除了在雪泥上偶然留下几处爪痕之外，又有谁会管它是要向东还是往西呢。雪泥鸿爪，路过胡，便一去无影踪，鸿爪本无影踪，你我相逢在黑夜的海上，你有你的，我有我的，方向。你记得也好， 最好你忘，本就是一次夏日限定的际会，在这交会时互放的光亮！在雪后便无痕。前程远大，这里并非终点，失败贯穿始终，世事无常，但也并非终点，你或许不会记得我，人生际遇偶然，但我更清楚遇见的可贵，艰难的往昔，温暖的会议，请带着思念和感动接着飞下去。感谢你出现在我人生的22岁！
]]>