

Research 
Biology, Genetics and Genomics
The work
in the Zhang lab in
this broad field is mainly computational and falls into the
category of Data
Sciences (or Big Data). We aim to find and understand
biological “stories”
through building models and identifying patterns from large
quantities of
genomescale biological data. In the process, we integrate the
bench work of
molecular biology and develop novel methods and efficient
software tools along
the way. In this broad area, we pursue two overarching themes,
network
systems biology and gene regulation via noncoding
RNA genes. We are
interested in basic biological questions (e.g., the biogenesis
of small
noncoding RNAs, transcriptional and posttranscriptional gene
regulation) and
their applications to complex diseases (e.g., Alzheimer’s
disease (AD) and
psoriasis) and plant stress response (e.g., rice blast and
rice bacterial
blight).
Networkbased,
allelespecific
genomewide association study
(GWAS): GWAS and
singlegene approaches have
been popular in identifying risk factors of Mendelian
diseases, such as
familial Alzheimer's disease (AD). Despite a tremendous amount
of effort,
however, these methods fail to decipher the missing
inheritability of complex
diseases, such as lateonset AD (or simply AD), diabetes, and
psoriasis. A key
factor that impairs conventional GWAS on polygenic diseases is
the lack of
effective ways to identify multiple genomic loci or genes
whose interactions or
associations underlie the manifestation of disease phenotypes.
For example, a
bruteforce search of possible associations among only three
markers within one
million single nucleotide polymorphisms (SNPs) requires
examining a prohibitive
number of 1.7×10^{17} trios. In the Zhang lab, we took
a network
perspective toward this problem and focused on developing
novel enabling
techniques for network GWAS. One network GWAS method recently
developed in the
lab is called BlocBuster (Climer,
Templeton and
Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014),
which
exploits the fact that there exist genetic architectures of
markers underlying
a complex trait. Such architectures are best represented as a
network where
correlated markers can be regarded as forming network modules
or communities.
We also used a new similarity measure to accommodate genetic
heterogeneity in
the population in order to make the network GWAS analysis
allelespecific,
robust, and rigorous. One application of the network GWAS
method to the HapMap population
SNPs data revealed a giant 284SNPs
divergent yinyang haplotype pattern encompassing gephyrin,
a gene with remarkably diverse functions related to
translation initiation,
regulation of synaptic protein synthesis, and mTOR
signaling, to name a few (Climer,
Templeton and
Zhang, Nature Comm., 6:6534, 2015; see
the first figure
below), suggest a strong positive selection on the region over
gephyrin, rapid evolution of
world populations, and
possible distinct molecular mechanisms involving gephyrin.
The discovery of this large yinyang pattern, which was missed
by all existing
methods, demonstrates the power of the network GWAS approach.
The second application
of the network method to GWAS data of psoriasis, an autoimmune
skin disorder,
helped reveal a module of 17 tightly correlated SNPs within
the Major
Histocompatibility Complex (MHC) on human chromosome 6 (Climer,
Templeton and
Zhang, PLOS Comp. Biol., 10(9):e1003766, 2014;
see also the
second figure below). While some of the 17 SNPs have been
indicated in previous
studies, our result showed that their specific alleles as a
whole provide a
stronger discriminative power to distinguish psoriasis
subjects from normal
controls. In short, the Zhang lab is the first to
systematically pursue the
notion of network GWAS to push the envelopment of conventional GWAS to a new
dimension and to
develop an effective enabling technique to approach complex
disease.
Networkbased
transcriptome
analysis: Manifestation
of complex diseases or traits is often the result of aberrant
gene expression
variations in disease states. Analysis of genomewide gene
expression
variations, or transcriptome analysis, can thus provide
insights into disease
mechanisms. In the same vein as network GWAS, the Zhang lab
has taken a
coexpression network approach to transcriptome analysis of
complex diseases by
exploiting the rationale of “guilty by association”. One
realization of this
idea is the HQCut method for
constructing and
analyzing coexpression networks using gene expression data (Ruan,
Dean
and Zhang, BMC Syst. Biol., 4:136, 2010;
Ruan
and Zhang, Physics Review E, 77:016104, 2008).
The central piece of
the method is a heuristic that combines a spectral clustering
technique that
maps gene expression data to a latent, lower dimensional
space, by which the
noise of low information content in the data can be
effectively removed. As a
result, clustering can be more accurately and efficiently
performed. This
network transcriptome method has been used in our own research
and by many
other labs. In particular, using this method, we were able to
unravel intricate
relationships among coexpressed genes in AD, diabetes, and
cardiovascular
diseases (Ray, Ruan and Zhang, Genome Biol.,
9(10):R148, 2008),
showing that many genes that are known to be involved in these
metabolic
diseases were also aberrantly expressed in the AD brain and
more interestingly,
appeared in one module of coexpressed genes (see the module
and genes in
yellow in the figure below). This result support
the
notion that AD is a "diabetes" of the brain. In addition,
using this
method we identified brainregion specific patterns of gene
expression
variation in the AD brain (Ray and
Zhang, BMC Syst.
Biol., 4:136, 2010) and revealed the genetics of
gene expression (or eQTL) in AD (Webster,
et al., Am. J.
Human Genetics, 84(4):44558, 2009). In short, our
work on network
biology has clearly demonstrated the power and rigor that
computational biology
can bring to complex biology problems.
Biogenesis
of
small noncoding RNAs: Gene
expression variations across disease states and the normal
condition are often
due to aberrant gene regulation where small noncoding RNAs (sncRNAs),
particularly microRNAs (miRNAs) and small interfering RNAs
(siRNAs), play
essential roles. Identification of sncRNAs
and
analysis of their biogenesis and regulatory functions are key
steps toward
elucidation of aberrant gene regulation. In my lab, we have
developed effective
methods for identification of miRNAs and siRNAs as well as
methods for
analyzing their biogenesis and potential functions in gene
regulation and
diseases. In particular, we developed an innovative method to
show that most
miRNAs in animals (e.g., human and mouse) and plants (e.g.,
rice and
Arabidopsis) are transcribed by RNA polymerase II (Zhou,
Ruan, Wang and Zhang, PLOS
Comp. Biol.,
3(3):e37, 2007). The analysis was done by first finding
cisregulatory
elements (motifs) in the promoter regions of miRNAs using our
WordSpy genomewide motif finding
method (Wang
and
Zhang, Genome Biol., 7(6):R49, 2006),
constructing a model of
miRNA transcription using the motifs, and then applying the
model to
distinguish miRNAs from proteincoding genes. By combining
large quantities of
sequencing data with bench experiments, we also showed that a
miRNA precursor
might generate multiple miRNAs (Zhang,
et
al., Genome Biol., 11(8):R81, 2010; see an
example of miR822 in
Arabidopsis in the first figure below) or a miRNA plus a siRNA
that regulates
DNA methylation, suggesting a model of dual biogenesis of
miRNA and siRNA (Chellappan,
et
al., Nucleic Acids Res., 38(20):688394, 2010;
see the
second figure below)
Moreover,
microRNAs may also
arise from the same gemomic loci
as other noncoding
RNAs, e.g., tRNAs and snoRNAs.
Combining computational and bench molecular biology work, we
identified and
confirmed that miRNAs can originate from the loci where tRNAs
were encoded in mouse and murine herpesvirus (Reese,
et al., J. Virology,
84(19):103453, 2010; see the first figure below for
some examples in
mouse). In addition, our extensive computational analysis of
large quantities
of profiling data revealed that there exist a huge number of
isoforms of
microRNAs in many eukaryotic organisms, including human (Xia and
Zhang, Nucleic
Acids Res., 42(3):142741, 2014; also see the
second figure below for
an example), some of which may also be functional in complex
diseases such as
psoriasis (Xia,
Joyce,
Bowcock and Zhang, Human Mol. Genetics,
22(4):73748, 2013). In
summary, our work has broadened our view on the biogenesis,
diversity and
function of small noncoding RNAs.
Regulatory
functions
of small noncoding RNAs: We
expanded our work to examine the potential roles of sncRNAs
in complex disease and plant stress response. For example,
using the miRNA
transcriptome modeling as a tool, we were able to identify
cold and UVBlight
responsive miRNAs in model plant Arabidopsis thaliana, most of
which were also
experimentally validated (Zhou,
Wang and Zhang, Mol. Syst. Biol., 3:13, 2007; Zhou, et
al., Biochim. Biophys. Acta,
1779(11):7808, 2008). This is the first
computational miRNA gene function annotation method by
modeling the possible
transcriptional regulators of microRNAs and their potential
targets (see the
first figure below for an example of putative Auxin signaling
pathways mediated
by microRNAs in response to UVB light stress). Furthermore,
combining
deepsequencing profiling, computational analysis, and
experimental validation,
we characterized the functions of microRNAs in psoriatic and
normal human skin
(Joyce
and et al., Human
Mol. Genetics, 20(20):402540, 2011; see some
microRNA aberrant
expression patterns in the second figure below and microR135b
expressed in
psoriatic skin in the third figure below), in viral infection
(Reese,
et al., J.
Virology, 84(19):103453, 2010), and plant stress
responses, for
example, in Aribidopsis (Zhang,
et al., Human
Mol. Genet., 20:402540, 2011) and cassava (Zeng, et
al., Nucleic
Acids Res., 38(3):98195, 2010. Our work has
broadened our view of the
biogenesis, diversity and functions of sncRNAs
in
diseases and plants. In summary, we have done some pioneering
work in
developing and using computational methods to study basic
biological problems
related to small noncoding RNA gene regulation that were
traditionally
approached through bench work.
Research
 Machine
learning/datamining
In
recent years, in order to
support the research activities in computational biology and
genomics, we have
been focusing on developing effective computational methods
for finding and
analyzing modular structures in large (biological) networks.
Identification
of
network modules in complex networks: The
network GWAS and coexpression network approaches described
above rely heavily
on accurately and efficiently finding network modules.
Nevertheless, finding
structures in large networks is a challenging problem in
machine learning and
datamining. First, network structural properties can be
characterized in
several different ways, by node connectivity, by properties on
links (such as
relationships among connected individuals), or by semantics of
nodes and links
(such as the social roles that individuals play in a society
network). In our
research, we considered these different types of information
for network module
finding and developed effective novel methods for finding
modules of nodes (Jin,
Chen,
He and Zhang, Proc. AAAI15), modules
of links (He,
Liu,
Jin and Zhang, Proc. AAAI15), and
modules that are defined
by network structural information and node semantics (He,
et al., Proc AAAI17).
We adopted a few different
machine learning techniques in developing our methods,
including nonnegative
matrix factorization (Wang, et al., Proc.
AAAI16), stochastic modeling (He,
Liu,
Jin and Zhang, Proc. AAAI15), deep
learning (Yang,
et
al., Proc. IJCAI16), and Markov Random Fields (He,
et
al., Proc. AAAI18).
Applications
of network module analysis: One of
the applications of this line
of research is the networkbased GWAS, as discussed in (Climer,
Templeton and Zhang, PLOS Comp. Biol.,
10(9):e1003766, 2014) and (Climer,
Templeton
and Zhang, Nature Comm., 6:6534, 2015).
Another
area of application is networkbased transcriptome analysis
for elucidation of
gene regulation, which has been studied in the Zhang lab, as
discussed in (Ray, Ruan
and Zhang, Genome Biol., 9(10):R148, 2008), (Ruan, Dean and Zhang, BMC Syst.
Biol., 4:136, 2010)
and (Ruan
and Zhang, Physics Review E, 77:016104, 2008).
Research
interests
 Heuristic search, planning and optimization
The
Zhang has a long history of
research on heuristic search, planning and optimization. The
work has made
significant contributions to the understanding of the nature
and complexity of
many search techniques, such as bestfirst search and
depthfirst
branchandbound, and combinatorial optimization problems,
such as the
Traveling Salesman Problem and Constraint Satisfaction.
Asymptotic
optimality
of linearspace search algorithms: Heuristic
search is one of the
cornerstone of AI. The bestfirst search (BFS) algorithm, with
the wellknown
A* algorithm as a special case, is time optimal. However, it
requires space
that is exponential in the search depth of the underlying
search tree, which is
impractical for large combinatorial optimization problems,
such as the
Traveling Salesman Problem (TSP). In contrast, a linearspace
search (LSS)
algorithm, with depthfirst search and depthfirst
branchandbound (DFBnB) as
special cases, runs linearly in search depth with
a penalty of exploring more nodes in the search space. Due to
their limited
space requirement, LSS algorithms, including DFBnB,
iterative deepening A* (IDA*), and recursive bestfirst search
(RBFS), are the
most popular techniques for large and difficult search
problems in practice.
Understanding the expected complexity of LSS algorithms was an
important and
challenging problem.Based on a
random search tree
model proposed by Judea Pearl and Richard Karp, our
research established the
asymptotic optimality of LSS methods,
showing that their
expected complexity is a constant factor of that of BFS (Zhand
and Korf,
Artificial Intelligence, 79(2):24192, 1995;
Zhang,
StateSpace
Search, Springer, 1999). Moreover, LSS algorithms
may run faster than
BFS on large problems, making them the algorithms for real
applications.
A
directly and early application
of our complexity results is the explanation to anomaly of the
fixed horizon
search, which was observed initially by Richard Korf
on slidingtile puzzles using DFBnB,
as illustrated
below. The anomaly states that for a fixed search depth, DFBnB
can search the trees with larger branching factors faster. In
fact, the anomaly
persists regardless which search method, e.g., ID (or IDA*) or
RBFS, shown
below.
The
insight gained by the
complexity analysis on this search anomaly helped predict the
existence of a
phase transition in heuristic search problems, discussed
below.
Phase
transitions
and structures of combinatorial search:
Our
study also revealed that the
expected complexity of BFS and LLS algorithms exhibits a sharp
and abrupt
transition from polynomial to exponential time as some
critical problem features
(e.g., the precision of heuristic function) shift (Zhand
and Korf,
Artificial Intelligence, 79(2):24192, 1995;
Zhang
and Korf, Artificial
Intelligence, 81(12):22339, 1996;
Zhang,
StateSpace
Search, Springer, 1999). This is similar to phase
transitions of spin
glasses studied in physics. A simple example of a phase
transition is water
changing from liquid to solid ice when the temperature drops
below the freezing
point. We identified that the accuracy of heuristic
information is the
controlling parameter of phase transitions. On the Judea&Karp
incremental random search tree model, the phase transition can
be summarized by
the following figure.
The control parameter of this phase transition is the product of the mean branching factor of the search tree and the probability if an edge cost has cost 0.
This analysis was extended to reveal phase transitions in combinatorial optimization problems, such as the asymmetric TSP (ATSP) (Zhang and Korf, Artificial Intelligence, 81(12):22339, 1996) and Boolean Satisfiability (Zhang, Artificial Intelligence, 158:126, 2004). For the ATSP, the control parameter is the precision of the distances between cities, which is equivalent to the number of digits used to represent distances. Intuitively, when the number of digits used is small, many distances may have a high chance to have the same value, and consequently many complete tours (visiting each and every city exactly once) may have a high chance to have the same cost as well. This also means that a large number of optimal TSP tours exist; and finding one optimal TSP tour should be relatively easy. In contrast, when no two distances between cites are the same, the number of optimal TSP tour is small, so that finding an optimal solution is difficult. This phase transition is illustrated in the figure below, where the Assignment Problem (AP) is the main computation step (heuristic) for solving the ATSP.
Interestingly,
there also exist
phase transitions in maximum Boolean satisfiability (max SAT),
which is the
optimization version of the SAT where the objective is to find
a variable
assignment that minimize the number of violated clauses (Zhang,
Artificial Intelligence, 158:126, 2004).
Note
that the phase
transitions that we studied in optimization problems, such
as finding the
minimal TSP tour and the variable assignment to minimize the
number of
unsatisfied constraints, are fundamentally different from
the phase transitions
for decision problems, such as the decision version of the
TSP and the SAT. For
optimization problems, the phase transitions appear as an
easytodifficult
transition, whereas for decision problems, the phase
transitions exhibit an
easytodifficultthentoeasy pattern.
The
computational complexity
and phase transition are also related to the backbone of a
problem, which is a
set of variables that have the same values among all solutions
(Zhang, Proc.
Intern. Conf.
on Principles and Practice of Constraint programming,
CP2001; Zhang,
JAIR, 20:47197, 2004;
Zhang,
Artificial Intelligence, 158:126, 2004).
Problem
solving
by exploiting phase transitions and backbones: The study of
phase transitions and problem
structures such as backbones has changed the way we
characterize the complexity
of combinatorial problems, beyond the notion of worstcase
complexity, and the
way we solve difficult problems. In particular, phase
transitions can not
only be used to characterize problems, but also be exploited
in problem
solving. We have done a pioneering work on developing
new efficient
algorithms for finding optimal and approximate solutions by
exploiting phase
transitions (Pemberton
and
Zhang, Artificial Intelligence, 81:297325, 1996;
Zhang,
Artificial Intelligence, 158:126, 2004). One of
the ideas is to
transform a state space that is difficult to search into one
that is easy to
explore. Another idea is to estimate and utilize backbone
structures to guide a
search. These techniques are characterized as backbone
guided search (Zhang,
Artificial Intelligence, 158:126, 2004; Zhang
and
Looks, Proc. IJCAI2005).
The
Zhang
Algorithm for the ATSP: The
asymmetric TSP (ATSP) is an important NPhard problem with
many real
applications in the areas such as scheduling. The research
largely focuses on
efficient approximation algorithms. The first approximation
method is the KanallakisPapadimitiou
local search algorithm, named after
its inventors, two eminent scientists in computer science and
operations
research. The second approach is represented by my truncated DFBnB algorithm (Zhang,
Proc. AAAI 1993 Spring Symp.
on AI and NPHard
Problems; Zhang,
StateSpace Search, Springer, 1999), which has
been referred to as the
Zhang algorithm (Johnson,
et
al.in The Traveling Salesman Problem and its Variations,
pp.44588,
2002).
Long
distance
mutual exclusion and propositional planning: A dominating
technique in planning is GraphPlan
with mutual exclusions (mutex)
to represent constraints among actions and events at the same
planning horizon
(depth). In our work, we introduced long distance mutual
exclusion (londex) to capture
constraints across multiple horizons,
developed an efficient algorithm to efficiently compute londex,
and designed and implemented our new MaxPlan
planner
using londex (Xing,
Chen
and Zhang, Proc. ICAPS06; Chen,
Huang,
Xing and Zhang, Artificial Intelligence, 173:36591,
2009).
The overall work provided a new paradigm for planning. Our MaxPlan
system won the First Place Award of the Fifth
International Planning
Competition in 2006 (Xing,
Chen
and Zhang, Proc. ICAPS06).
Transition
based
encoding for planning: The
technique of planning as Boolean satisfiability usually uses
encodings in logic
statements. We introduced a novel SAT encoding scheme (SASE)
based on the SAS+
formalism (Huang,
Chen
and Zhang, Proc. AAAI10; Huang,
Chen and
Zhang, JAIR, 43:293328, 2012). The new scheme
exploits the
structural information so that the encoding is both compact
and efficient for
planning. We proved the correctness of the new encoding and
analyzed the
rationale behind its improvement in planning performance. Our
extensive
experimental results demonstrated significant improvements of
SASE over the stateoftheart
methods in time and space. This work received the
Outstanding Paper Award
of the 24th AAAI Conference on Artificial Intelligence
Conference in 2010
(AAA10) (Huang,
Chen
and Zhang, Proc. AAAI10), chosen from a double
blind review
process of 894 submissions. Note that AAAI is one of the two
leading
international AI conferences.