[if lte IE 8]>

Hi. I'm Dr. Yang Li

I received my Ph.D. degree from Computer Science Department at University of California, Santa Barbara and I am currently a Staff Machine Learning Engineer at Meta Inc. My research interests are Knowledge Graph Construction, Natural Language Understanding, Recommendation Systems, Data Mining and Applied Machine Learning.

Résumé LinkedIn What's New

What's New

Education

UCSB

Ph.D. in Computer Science

Sep. 2010 - Jun. 2015

Santa Barbara, CA

ZJU

B.E. in Computer Science

Sep. 2006 - Jun. 2010

Hangzhou, China

UCLA

Visiting Student

Jul. 2009 - Sep. 2009

Los Angeles, CA

Research Projects

Acronym Disambiguation for Enterprises

[Abstract]

Entity Disambiguation with Linkless Knowledge Bases

[Abstract]

Mining Evidences for Named Entity Disambiguation

[Abstract]

Answering Elementary Science Questions via Coherent Scenes

[Abstract]

Interpreting Public Sentiment Variations on Twitter

[Abstract]

Memory Efficient Minimum Substring Partitioning

[Abstract]

Selected Publications

[1] Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaoming Wang, Taibai Xu, and Ed H. Chi., "Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations", Companion Proc. of the Web Conference 2020 (WWW'20), Taipei, April 2020. [Abstract] [Paper] [Citations]

Learning query and item representations is important for building large scale recommendation systems. In many real applications where there is a huge catalog of items to recommend, the problem of efficiently retrieving top k items given user’s query from deep corpus leads to a family of factorized modeling approaches where queries and items are jointly embedded into a low-dimensional space. In this paper, we first showcase how to apply a two-tower neural network framework, which is also known as dual encoder in the natural language community, to improve a large-scale, production app recommendation system. Furthermore, we offer a novel negative sampling approach called Mixed Negative Sampling (MNS). In particular, different from commonly used batch or unigram sampling methods, MNS uses a mixture of batch and uniformly sampled negatives to tackle the selection bias of implicit user feedback. We conduct extensive offline experiments using large-scale production dataset and show that MNS outperforms other baseline sampling methods. We also conduct online A/B testing and demonstrate that the two-tower retrieval model based on MNS significantly improves retrieval quality by encouraging more high-quality app installs.

[2] Xiaolan Wang, Xin Luna Dong, Yang Li and Alexandra Meliou, "MIDAS: Finding the Right Web Sources to Fill Knowledge Gaps", Proc. of the 35th IEEE International Conference on Data Engineering (ICDE'19), Macau SAR, China, April 2019. [Abstract] [Paper] [Citations]

Knowledge bases, massive collections of facts (RDF triples) on diverse topics, support vital modern applications. However, existing knowledge bases contain very little data compared to the wealth of information on the Web. This is because the industry standard in knowledge base creation and augmentation suffers from a serious bottleneck: they rely on domain experts to identify appropriate web sources to extract data from. Efforts to fully automate knowledge extraction have failed to improve this standard: these automated systems are able to retrieve much more data and from a broader range of sources, but they suffer from very low precision and recall. As a result, these large-scale extractions remain unexploited. In this paper, we present MIDAS, a system that harnesses the results of automated knowledge extraction pipelines to repair the bottleneck in industrial knowledge creation and augmentation processes. MIDAS automates the suggestion of good-quality web sources and describes what to extract with respect to augmenting an existing knowledge base. We make three major contributions. First, we introduce a novel concept, web source slices, to describe the contents of a web source. Second, we define a profit function to quantify the value of a web source slice with respect to augmenting an existing knowledge base. Third, we develop effective and highly-scalable algorithms to derive high-profit web source slices. We demonstrate that MIDAS produces high-profit results and outperforms the baselines significantly on both real-word and synthetic datasets.

[3] David Gold, Takeo Katsuki, Yang Li, Xifeng Yan, Michael Regulski, David Ibberson, Thomas Holstein, Robert Steele, David Jacobs, and Ralph Greenspan, "The Genome of the Jellyfish Aurelia and the Evolution of Animal Complexity", Nature Ecology and Evolution, 2018. [Abstract] [Paper] [Citations]

We present the genome of the moon jellyfish Aurelia, the first genome from a cnidarian with a medusa life stage. Our analyses suggest that gene gain and loss in Aurelia is comparable to what has been found in its morphologically simpler relatives—the anthozoan corals and sea anemones. RNA-Seq analysis does not support the hypothesis that taxonomically restricted (orphan) genes play an oversized role in the development of the medusa stage. Instead, genes broadly conserved across animals and eukaryotes play comparable roles throughout the life cycle. All life stages of Aurelia are significantly enriched in the expression of genes that are hypothesized to interact in protein networks found in bilaterian animals. Collectively, our results suggest that increased life cycle complexity in Aurelia does not correlate with an increased number of genes. This leads to two possible evolutionary scenarios: either medusozoans evolved their complex medusa life stage (with concomitant shifts into new ecological niches) primarily by re-working genetic pathways already present in the last common ancestor of cnidarians, or the earliest cnidarians had a medusa life stage, which was subsequently lost in the anthozoans. While we favor the earlier hypothesis, the latter is consistent with growing evidence that many of the earliest animals were more physically complex than previously hypothesized.

[4] Yang Li, Bo Zhao, Ariel Fuxman and Fangbo Tao, "Guess Me If You Can: Acronym Disambiguation for Enterprises", Proc. of the 56th Annual Meeting of the Association for Computational Linguistics (ACL'18), Melbourne, Australia, July 2018. [Abstract] [Paper] [Video] [Citations]

Acronyms are abbreviations formed from the initial components of words or phrases. In enterprises, people often use acronyms to make communications more efficient. However, acronyms could be difficult to understand for people who are not familiar with the subject matter (new employees, etc.), thereby affecting productivity. To alleviate such troubles, we study how to automatically resolve the true meanings of acronyms in a given context. Acronym disambiguation for enterprises is challenging for several reasons. First, acronyms may be highly ambiguous since an acronym used in the enterprise could have multiple internal and external meanings. Second, there are usually no comprehensive knowledge bases such as Wikipedia available in enterprises. Finally, the system should be generic to work for any enterprise. In this work we propose an end-to-end framework to tackle all these challenges. The framework takes the enterprise corpus as input and produces a high-quality acronym disambiguation system as output. Our disambiguation models are trained via distant supervised learning, without requiring any manually labeled training examples. Therefore, our proposed framework can be deployed to any enterprise to support high-quality acronym disambiguation. Experimental results on real world data justified the effectiveness of our system.

[5] Furong Li, Luna Dong, Anno Langen and Yang Li, "Knowledge Verification for Long-tail Verticals", Proc. of the 43th International Conference on Very Large Databases (VLDB'17), Munich, Germany, August 2017. [Abstract] [Paper] [Citations]

Collecting structured knowledge for real-world entities has become a critical task for many applications. A big gap between the knowledge in existing knowledge repositories and the knowledge in the real world is the knowledge on tail verticals (i.e., less popular domains). Such knowledge, though not necessarily globally popular, can be personal hobbies to many people and thus collectively impactful. This paper studies the problem of knowledge verification for tail verticals; that is, deciding the correctness of a given triple. Through comprehensive experimental study we answer the following questions. 1) Can we find evidence for tail knowledge from an extensive set of sources, including knowledge bases, the web, and query logs? 2) Can we judge correctness of the triples based on the collected evidence? 3) How can we further improve knowledge verification on tail verticals? Our empirical study suggests a new knowledge-verification framework, which we call Facty, that applies various kinds of evidence collection techniques followed by knowledge fusion. Facty can verify 50% of the (correct) tail knowledge with a precision of 84%, and it significantly outperforms state-of-the-art methods. Detailed error analysis on the obtained results suggests future research directions.

[6] Yang Li, Shulong Tan, Huan Sun, Jiawei Han, Dan Roth and Xifeng Yan, "Entity Disambiguation with Linkless Knowledge Bases", Proc. of the 25th International World Wide Web Conference (WWW'16), Montréal, Québec, Canada, April 2016. [Abstract] [Paper] [Slides] [Citations]

Named Entity Disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a reference knowledge base (e.g. Wikipedia). Such disambiguation can help add semantics to plain text and distinguish homonymous entities. Previous research has tackled this problem by making use of two types of context-aware features derived from the reference knowledge base, namely, the context similarity and the semantic relatedness. Both features heavily rely on the cross-document hyperlinks within the knowledge base: the semantic relatedness feature is directly measured via those hyperlinks, while the context similarity feature implicitly makes use of those hyperlinks to expand entity candidates' descriptions and then compares them against the query context. Unfortunately, cross-document hyperlinks are rarely available in many closed domain knowledge bases and it is very expensive to manually add such links. Therefore few algorithms can work well on linkless knowledge bases. In this work, we propose the challenging Named Entity Disambiguation with Linkless Knowledge Bases (LNED) problem and tackle it by leveraging the useful disambiguation evidences scattered across the reference knowledge base. We propose a generative model to automatically mine such evidences out of noisy information. The mined evidences can mimic the role of the missing links and help boost the LNED performance. Experimental results show that our proposed method substantially improves the disambiguation accuracy over the baseline approaches.

[7] Yang Li, Peter Clark, "Answering Elementary Science Questions by Constructing Coherent Scenes Using Background Knowledge", Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP'15), Lisboa, Portugal, September 2015. [Abstract] [Paper] [Slides] [Video] [Citations]

Much of what we understand from text is not explicitly stated. Rather, the reader uses his/her knowledge to fill in gaps and create a coherent, mental picture or "scene" depicting what text appears to convey. The scene constitutes an understanding of the text, and can be used to answer questions that go beyond the text. Our goal is to answer elementary science questions, where this requirement is pervasive; A question will often give a partial description of a scene and ask the student about implicit information. We show that by using a simple "knowledge graph" representation of the question, we can leverage several large-scale linguistic resources to provide missing background knowledge, somewhat alleviating the knowledge bottleneck in previous approaches. The coherence of the best resulting scene, built from a question/answer-candidate pair, reflects the confidence that the answer candidate is correct, and thus can be used to answer multiple choice questions. Our experiments show that this approach outperforms competitive algorithms on several datasets tested. The significance of this work is thus to show that a simple "knowledge graph" representation allows a version of "interpretation as scene construction" to be made viable.

[8] Fangbo Tao, Bo Zhao, Ariel Fuxman, Yang Li and Jiawei Han, "Leveraging Pattern Semantics for Extracting Entities in Enterprises"", Proc. of the 24th International World Wide Web Conference (WWW'15), Florence, Italy, May 2015. [Abstract] [Paper] [Citations]

Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redundancy of enterprise entities makes previous web-based systems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enterprises ("Blue" refers to "Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-to-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics , and yield more accurate and complete entities. Experiments on Microsoft enterprise data show the effectiveness of our approach.

[9] Huan Sun, Mudhakar Srivatsa, Shulong Tan, Yang Li, Lance Kaplan, Shu Tao and Xifeng Yan, "Analyzing Expert Behaviors in Collaborative Networks", Proc. of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'14), New York, NY, USA, August 2014. [Abstract] [Paper] [Slides] [Poster] [Citations]

Collaborative networks are composed of experts who cooperate with each other to complete specific tasks, such as resolving problems reported by customers. A task is posted and subsequently routed in the network from an expert to another until being resolved. When an expert cannot solve a task, his routing decision (i.e., where to transfer a task) is critical since it can significantly affect the completion time of a task. In this work, we attempt to deduce the cognitive process of task routing, and model the decision making of experts as a generative process where a routing decision is made based on mixed routing patterns. In particular, we observe an interesting phenomenon that an expert tends to transfer a task to someone whose knowledge is neither too similar to nor too different from his own. Based on this observation, an expertise difference based routing pattern is developed. We formalize multiple routing patterns by taking into account both rational and random analysis of tasks, and present a generative model to combine them. For a held-out set of tasks, our model not only explains their real routing sequences very well, but also accurately predicts their completion time. Under three different quality measures, our method significantly outperforms all the alternatives with more than 75% accuracy gain. In practice, with the help of our model, hypotheses on how to improve a collaborative network can be tested quickly and reliably, thereby significantly easing performance improvement of collaborative networks.

[10] Yang Li, Chi Wang, Fangqiu Han, Jiawei Han, Dan Roth and Xifeng Yan, "Mining Evidences for Named Entity Disambiguation", Proc. of the 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'13), Chicago, IL, USA, August 2013. [Abstract] [Paper] [Poster] [Citations]

Named entity disambiguation is the task of disambiguating named entity mentions in natural language text and link them to their corresponding entries in a knowledge base such as Wikipedia. Such disambiguation can help enhance readability and add semantics to plain text. It is also a central step in constructing high-quality information network or knowledge graph from unstructured text. Previous research has tackled this problem by making use of various textual and structural features from a knowledge base. Most of the proposed algorithms assume that a knowledge base can provide enough explicit and useful information to help disambiguate a mention to the right entity. However, the existing knowledge bases are rarely complete (likely will never be), thus leading to poor performance on short queries with not well-known contexts. In such cases, we need to collect additional evidences scattered in internal and external corpus to augment the knowledge bases and enhance their disambiguation power. In this work, we propose a generative model and an incremental algorithm to automatically mine useful evidences across documents. With a specific modeling of "background topic" and "unknown entities", our model is able to harvest useful evidences out of noisy information. Experimental results show that our proposed method outperforms the state-of-the-art approaches significantly: boosting the disambiguation accuracy from 43% (baseline) to 86% on short queries derived from tweets.

[11] Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan and Subhash Suri, "Memory Efficient Minimum Substring Partitioning", Proc. of the 39th International Conference on Very Large Databases (VLDB'13), Riva del Garda, Trento, Italy, August 2013. [Abstract] [Paper] [Slides] [Press] [Citations]

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from Θ(kn) to Θ(n), where n is the size of the short read database, and k is the length of a k-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

[12] Shulong Tan, Yang Li, Huan Sun, Ziyu Guan, Xifeng Yan, Jiajun Bu, Chun Chen and Xiaofei He, "Interpreting Public Sentiment Variations on Twitter", Transactions on Knowledge and Data Engineering (TKDE'13), August 2013. [Abstract] [Paper] [Citations]

Millions of users share their opinions on Twitter, making it a valuable platform for tracking and analyzing public sentiment. Such tracking and analysis can provide critical information for decision making in various domains. Therefore it has attracted attention in both academia and industry. Previous research mainly focused on modeling and tracking public sentiment. In this work, we move one step further to interpret sentiment variations. We observed that emerging topics (named foreground topics) within the sentiment variation periods are highly related to the genuine reasons behind the variations. Based on this observation, we propose a Latent Dirichlet Allocation (LDA) based model, Foreground and Background LDA (FB-LDA), to distill foreground topics and filter out longstanding background topics. These foreground topics can give potential interpretations of the sentiment variations. To further enhance the readability of the mined reasons, we select the most representative tweets for foreground topics and develop another generative model called Reason Candidate and Background LDA (RCB-LDA) to rank them with respect to their "popularity" within the variation period. Experimental results show that our methods can effectively find foreground topics and rank reason candidates. The proposed models can also be applied to other tasks such as finding topic differences between two sets of documents.

[13] Yang Li and Xifeng Yan, "MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting", arXiv.org preprint. [Abstract] [Paper] [Software] [Citations]

A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demon- strate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches.

Work Experience

Meta Inc.

Machine Learning Engineer

May. 2022 - Present

Menlo Park, CA

Pinterest Inc.

Machine Learning Engineer

Oct. 2020 - Apr. 2022

Palo Alto, CA

Google Inc.

Machine Learning Engineer

Aug. 2015 - Oct. 2020

Mountain View, CA

Internship Experience

Allen Ins. for AI

Research Intern

Jan. 2015 - Apr. 2015

Seattle, WA

Microsoft Research

Research Intern

Jun. 2014 - Sep. 2014

Mountain View, CA

Apple Inc.

Research Intern

Jun. 2013 - Sep. 2013

Cupertino, CA

Contact Me

Email

LinkedIn

Meeting

Also find me on ...