EnCORE

The Institute for Emerging CORE Methods in Data Science

Applications are now open for EnCORE Visitors Program

NEWS

1st Edition of EnCORE Magazine

Introducing the first edition of the EnCORE Institute magazine, celebrating achievements in advancing data science.
Learn More

High School Visit Day

The EnCORE Institute is hosting a High School Visit Day, offering students an opportunity to explore cutting-edge research in computer science, astronomy, and quantum computing.
Learn More

2025 IEEE Computer Society W. Wallace McDowell Award

"For contributions to complexity theory, pseudo randomness, communication complexity, and for establishing new connections between computer science and combinatorics, probability theory."
Learn More

Workshop on Theoretical Perspectives on LLMs

EnCORE presents a 3-day workshop on Theoretical Perspectives on LLMs to discuss progress, set goals, foster collaborations, and explore promising research directions.
Learn More

Workshop on Meta-Complexity as the Bridge Between Learning and Cryptography

EnCORE presents a 5-day workshop in meta-complexity and cryptography to discuss progress, set goals, foster collaborations, and explore promising research directions.
Learn More

Workshop on Defining Holistic Private Data Science for Practice

EnCORE is hosting a 3-day workshop focused on applying privacy-preserving tools in data science to real-world scenarios.
Learn More

Professor Rajiv Gandhi featured in NSF CISE Newsletter "Faces of CISE"

Professor Rajiv Gandhi is making waves in theoretical computer science, specializing in approximation and randomized algorithms.
Learn More

NSF Site Visit 2024

Site visits are crucial for ensuring that organizations manage NSF awards effectively, comply with regulations, and uphold grant terms, protecting the integrity and accountability of research funding.
Learn More

EnCORE Industry Day 2024

Industry Day showcased cutting-edge research in data science, machine learning, and AI, with expert contributions and networking opportunities from top institutions
Learn More

Barna Saha Celebrated As Harry E. Gruber Endowed Chair at UC San Diego

Barna Saha, a computer scientist at UCSD and the director of EnCORE, will be honored as the Harry E. Gruber Endowed Chair on April 30, 2024.
Learn More

Raghu Meka's work featured in Quanta as Biggest Breakthroughs in Math

Raghu Meka's and Zander Kelley's work on three-term arithmetic was selected as one of three mathematical breakthroughs in 2023 by Quanta Magazine.
Learn More

Rising Stars in Data Science 2023

Rising Stars in Data Science was hosted November 13-14,2023 by the University of Chicago with the University of California, San Diego – EnCORE and HDSI.
Learn More

Empowering Young Minds: Robotic Arts and Craft Workshop for Elementary School Kids

Saura Naderi of HDSI Lab 3.0 and EnCORE, brought together a group of elementary school students for an immersive workshop in robotics and creativity on Saturday, December 2.
Learn More

EnCORE: Call for Extended Research Visits & Workshops

EnCORE welcomes proposals for extended research visits between 2024 - 2025! Submit your proposals today.
Learn More

Surprise Computer Science Proof Stuns Mathematicians

Zander Kelley and Raghu Meka make a mathematical breakthrough on one of the biggest unsolved problems in the field of additive combinatorics.
Learn More

Understanding Self Distillation in the Presence of Label Noise

This study explores the phenomenon of the "student" model outperforming the "teacher" model in self-distillation (SD) with the presence of noise, while formalizing the limitations of SD.
Learn More

U.S. census data vulnerable to attack without enhanced privacy measures

A new PNAS has revealed that the statistics published by the U.S. Census Bureau can be reverse-engineered, revealing private information about individual respondents.
Learn More

A Mathematical Breakthrough by the EnCORE Faculty Raghu Meka

Absolutely Sensational Morning News – Zander Kelley and Raghu Meka proved Behrend-type bounds for 3APs.
Learn More

Expand Foundations of Data Science

To learn more contact: barnas@ucsd.edu with the subject line Expand TRIPODS
Learn More
EnCORE is hiring for a Postdoctoral Position
Learn More

New $10M NSF-Funded Institute Will Get to the CORE of Data Science

EnCORE will join three other NSF-funded institutes in the country dedicated to the exploration of data science through the NSF’s Transdisciplinary Research in Principles of Data Science Phase II (TRIPODS) program.
Learn More

UCLA Part of New $10M NSF Data Science Research Center, EnCORE

UCLA will be part of a multi-institutional research center funded by the National Science Foundation (NSF).
Learn More

$10M NSF-Funded Institute Will Get to the CORE of Data Science

New $10M NSF-Funded Institute Will Get to the CORE of Data Science. EnCORE will tackle important problems in foundations of Data Science
Learn More

NSF Awards

New NSF awards will advance theoretical foundations of data science research through interdisciplinary collaborations
Learn More

Data Science Theory

Join our Youtube Channel!

Data Science Theory

Join our Youtube Channel to learn more on Foundation of Data Science Series!
Learn More

Computation+Compression

workshop received an overwhelming response. Registration had to be closed at 500 registered participants.

Computation+Compression

Recording of all talks on Computation+Compression workshop are now available.
Visit Website

Masters of Data Science Online

Applications now open!

Masters of Data Science Online

Applications for Masters of Data Science Online is now open for Fall 2022 at HDSI.
Learn More

New Book Announcement:

Computational Topology for Data Analysis. T.K. Dey, and Y. Wang. Forthcoming book, Cambridge University Press, 2021
Computational Topology for Data Analysis. T.K. Dey, and Y. Wang. Forthcoming book, Cambridge University Press, 2021 [PDF version avaliable]
Learn More

General Chair of ICML

Chaudhuri will be the General Chair of ICML 2022.

Chaudhuri will be the General Chair of ICML 2022.

First cohort of Ph.D. students in Data Science

HDSI will be welcoming the first cohort of Ph.D. students in Data Science in Fall 2022.

HDSI

HDSI will be welcoming the first cohort of Ph.D. students in Data Science in Fall 2022. Apply now!
Learn More

Congratulations to Professor Yusu Wang

as ACM Distinguished Memeber.
Congratulations to Professor Yusu Wang on her recognition as ACM Distinguished Member for outstanding scientific contributions to computing!
Learn More

General Chair of STOC 2023

Saha will be the General Chair of STOC 2023.

Saha will be the General Chair of STOC 2023.

I am Data Science

Mazumdar featured in I am Data Science series.

Mazumdar featured in I am Data Science series.

Learn More

The EnCORE Workshop: New Horizons for Adaptive Robustness focuses on foundational challenges and emerging opportunities in the design and analysis of algorithms and systems that remain reliable under adaptive and adversarial conditions. The goal of the workshop is to bring together researchers working on different aspects of adaptive robustness to share recent developments, explore emerging connections across areas such as streaming, data structures, differential privacy, and cryptography, and foster new collaborations.

Read More >>

We’re excited to invite you to the upcoming EnCORE Collaboration Workshop, taking place on Monday, September 8th, 2025, from 10:00 AM to 4:25 PM (PT). The workshop will be held in person at the EnCORE space (UC San Diego, Atkinson Hall, 4th Floor) and will also be accessible via Zoom for remote participants.

Read More >>

We’re thrilled to kick off the Advanced FinDS program, created to help high school students deepen their understanding of data science by building a strong foundation in mathematics. This program introduces key concepts in data science in an engaging and accessible way, while also strengthening the critical thinking and problem-solving skills that are essential for success in STEM fields.

Read More >>

EnCORE is pleased to announce a summer camp for students who are interested in enhancing their understanding of algorithms. This camp aims to provide a concise overview of the algorithmic knowledge students are expected to learn at the University of California, and more broadly, in a standard four-year college curriculum. This knowledge is key to a successful academic journey exploring computing related majors at four-year colleges.

Read More >>

Join us for an EnCORE presentation on “Revisiting PAC Learning” on Monday, April 7, 2025 at 11AM (PT). The talk will be held at UC San Diego, Atkinson Hall, 4th Floor – EnCORE space, and will also be accessible via Zoom.

Read More >>

Workshop on Theoretical Perspectives on LLMs

Join EnCORE from March  3 – 5, 2025, for a three-day workshop on Theoretical Perspectives on Large Language Models (LLMs) explores foundational theories and frameworks underlying the architecture, learning mechanisms, and capabilities of large language models. This workshop brings together researchers to discuss recent advancements, theoretical challenges, and emerging concepts in understanding and predicting LLM behavior, efficiency, generalization, and alignment with human intent. 

Read More >>

EnCORE 2-Day Tutorial

Join us for an EnCORE Tutorial on “Forster-Warmuth Counterfactual Regression: A Unified Learning Approach” and “Doubly Robust and Efficient Calibration of Prediction Sets for Censored Time-to-Event Outcomes” from February 24 – 15, 2025, at 11 AM (PT). The talks will be held at UC San Diego, Atkinson Hall, 4th Floor – EnCORE space, and will also be accessible via Zoom. 

Read More >>

Towards Interpretable Deep Learning

Join us for an EnCORE presentation on “Towards Interpretable Deep Learning” on Thursday, February 13, 2025, at 10 AM (PT). The talk will be held at UC San Diego, Atkinson Hall, 4th Floor – EnCORE space, and will also be accessible via Zoom. 

Read More >>

Workshop on Meta-Complexity as the Bridge Between Learning and Cryptography

Join EnCORE from February 3 – 7, 2025, for a five-day workshop that explores the fascinating intersection of meta-complexity, cryptography, and computational learning , where experts will share breakthroughs, set future goals, and spark collaborations. With research talks, tutorials, and open problem sessions, this invitation-only event promises impactful discussions and innovative insights.

Read More >> 

Workshop on Defining Holistic Private Data Science for Practice

Join EnCORE from January 8–10, 2025, for a three-day workshop exploring the intersection of theoretical advancements and practical applications of privacy-preserving tools in data science. Discussions will cover privacy needs, cost evaluations, assessment methods, and strategies for real-world implementation of privacy technologies.

Read More >> 

Collaboration Workshop

Join us for the EnCORE Collaboration Workshop! Participants will present their work, exchange ideas, and explore collaborative opportunities. This mandatory workshop is scheduled for December 4, 2024, from 9:30 AM to 11:00 AM (PT) in the EnCORE space and via Zoom.

Read More >> 

Drifting Games and Online Learning

Join us on Thursday, November 21, 2024 at 11 AM (PT) for a talk on “Drifting Games and Online Learning”  presented by Yoav Freund. The talk will be held at UC San Diego, Atkinson Hall, 4th Floor – EnCORE , and will also be accessible via Zoom. 

Read More >> 

How Do Algorithms See the World? Computing and High-Stakes Decision-Making

Join us on Friday, November 8, 2024 at 11 AM (PT) for EnCORE’s first Public Lecture presented by Jon Kleinberg, “How Do Algorithms See the World? Computing and High-Stakes Decision-Making.” The talk will be held at UC San Diego, HDSI Multipurpose Room, and will also be accessible via Zoom. 

Read More >> 

Deep Learning: a Non-parametric Statistical Viewpoint

Join us on Thursday, October 24, 2024, at 10 AM (PST) for a presentation on “Deep Learning: a Non-parametric Statistical Viewpoint.” This talk will be held at UC San Diego and Zoom.

Read More >>

EnCORE Student Seminar

Join us on Monday, October 28, 2024, at 10 AM (PST) for a student seminar featuring Chhavi Yadav (UCSD) and Natalie Collina (UPenn). The speakers will present “FairProof : Confidential and Certifiable Fairness for Neural Networks” AND “Tractable Agreement Protocols via Calibration .” The seminar will be held via Zoom. 

Read More >>

Learning from Noisy Labels and Imperfect Teachers 

Datasets used in machine learning and statistics are huge and often imperfect, e.g., they contain corrupted data, examples with wrong labels, or hidden biases. Existing approaches often produce unreliable results when the datasets are corrupted, are computationally inefficient, or come without any theoretical/provable performance guarantees. In this talk, I will discuss the design of learning algorithms that are computationally efficient and provably reliable, and then present an application in knowledge distillation.

Read More >>

NSF Site Visit

Site visits are one of several monitoring tools to provide oversight of NSF’s research award portfolio. • Site visits assess an organization’s capacity for award administration; compliance with administrative regulations, public policy requirements, and award terms and conditions, including those contained in the NSF program announcement/solicitation and grant or cooperative agreement; and the extent to which the organization maintains a control environment within which awards are likely to be administered in compliance with federal financial and administrative regulations and NSF agreement provisions.

Read More >>

Industry Day 

This all-day event showcased exciting research in foundations of data science, ML systems and AI, with contributions from leading experts in industry and academia. Attendees had the opportunity to network with entrepreneurs, industry researchers, faculty and peers from leading institutions such as UCLA, UT Austin, and UPenn. The event introduced a dynamic exploration of cutting-edge advancements and collaborative opportunities centered around machine learning and AI

Read More >>

Reclaiming Data Agency in the Age of Ubiquitous Machine Learning

This talk introduces the concept of data agency, which empowers individuals to control how and whether their data is used in machine learning (ML) models. As ML models grow in size and require more data, individuals face risks like privacy and intellectual property violations. While many existing solutions focus on mitigating these risks through methods like differential privacy, encrypted model training, and federated learning, data agency takes a different approach by giving users the ability to know and manage their data’s use in ML systems. The talk proposes technical tools to trace or disrupt data use, addresses potential limitations, and discusses how these tools can complement current privacy protections while amplifying efforts to regulate data use in ML.

Read More >>

Rising Stars in Data Science 2024: Applications and Info Session 

The Rising Stars in Data Science workshop, hosted by the University of California San Diego in collaboration with the University of Chicago and Stanford University, aims to celebrate and accelerate the careers of exceptional data scientists at a pivotal point in their professional journey. This workshop supports the transition to roles such as postdoctoral scholar, research scientist, industry researcher, or tenure-track faculty. Over the past four years, we have proudly hosted over 130 Rising Stars from nearly 40 institutions.

Read More >>

Image

EnCORE’s Spring 2024 Collaboration Workshop 

EnCORE students have teamed up to collaborate on various topics, such as augmented algorithms and large language models! 

This workshop will take place on Monday, June 3, 2024, at 9:00 am (PST) at Atkinson Hall – 4th Floor – EnCORE room at UCSD and via Zoom. 

Read More >>

TEAMS: Teaching Enriched Algorithmic Topics to Middle School Students

This program offers middle school students an exciting introduction to algorithms and coding. Participants will explore the “Good, Bad, and Ugly” algorithms and learn shortest path algorithms through engaging worksheets, such as planning routes from one place to another. The session also highlights coding basics, using Google Maps as a fun, practical example to illustrate key concepts.

Read More >>

The Puzzle of Dimensionality and Feature Learning in Neural Networks and Kernel Machines

Remarkable progress in AI has far surpassed expectations of just a few years ago. At their core, modern models, such as transformers, implement traditional statistical models — high order Markov chains. Nevertheless, it is not generally possible to estimate Markov models of that order given any possible amount of data. Therefore these methods must implicitly exploit low-dimensional structures present in data. 

Read More >>

Generalization and Stability in Interpolating Neural Networks

Neural networks are renowned for their ability to memorize datasets, often achieving near-zero training loss via gradient descent optimization. Despite this capability, they also demonstrate remarkable generalization to new data. We investigate the generalization behavior of neural networks trained with logistic loss through the lens of algorithmic stability. Our focus lies on the neural tangent regime, where network weights move a constant distance from initialization to solution to achieve minimal training loss. Our main finding reveals that under NTK-separability, optimal test loss bounds are achievable if the network width is at least poly-logarithmically large with respect to the number of training samples. 

Read More >>

Image

Old Questions and New Directions in Theory of Clustering 

EnCORE is hosting a 3-day workshop to reinvigorate collaboration between the approximation and computational geometry communities. 

Read More >>

Deep Generative Models and Inverse Problems for Signal Reconstruction

Alex Dimakis, a professor at UT Austin and the co-director of the National AI Institute on the Foundations of Machine Learning, will be presenting the talk “Deep Generative Models and Inverse Problems for Signal Reconstruction” on April 1, 2024, as a part of the Foundations of Data Science – Virtual Talk Series.

Read More >>

Characterizing and Classifying Cell Types of the Brain: An Introduction for Computational Scientists

Join the 2-day EnCORE tutorial “Characterizing and Classifying Cell Types of the Brain: An Introduction for Computational Scientists” with Michael Hawrylycz, Ph.D., Investigator, Allen Institute. 

This tutorial will take place on March 21 – 22, 2024 at UC San Diego in the Computer Science and Engineering Building in Room 1242. There will be a Zoom option available as well. 

Read More >>

Bounded Weight Edit Distance

Join EnCORE’s talk on “Theoretical Exploration of Foundation Model Adaptation Methods” with Kangwook Lee, an assistant professor in the Electrical and Computer Engineering Department and the Computer Sciences Department at the University of Wisconsin-Madison.

Read More >>

EnCORE: Call for Extended Research Visits & Workshops

EnCORE welcomes proposals for extended research visits between 2024 - 2025! Submit your proposals today.
Learn More

Workshop on New Horizons for Adaptive Robustness

EnCORE presents a 3-day workshop on New Horizons for Adaptive Robustness. This workshop aims to bring together researchers working on different aspects of adaptive robustness to share recent developments.
Learn More

Workshop on Fine-Grained Complexity

EnCORE presents a 5-day workshop on Fine-Grained Complexity to discuss progress, set goals, foster collaborations, and explore promising research directions.
Learn More

Workshop on Theoretical Perspectives on LLMs

EnCORE presents a 3-day workshop on Theoretical Perspectives on LLMs to discuss progress, set goals, foster collaborations, and explore promising research directions.
Learn More

Workshop on Meta-Complexity as the Bridge Between Learning and Cryptography

EnCORE presents a 5-day workshop in meta-complexity and cryptography to discuss progress, set goals, foster collaborations, and explore promising research directions.
Learn More

Workshop on Foundations of Fairness and Accountability

EnCORE is hosting a 3-day workshop on fairness, accountability, and their applications in law, AI, and biology.
Learn More

Workshop on Defining Holistic Private Data Science for Practice

EnCORE is hosting a 3-day workshop focused on applying privacy-preserving tools in data science to real-world scenarios.
Learn More

Old Questions and New Directions in Clustering

EnCORE is hosting a 3-day workshop to reinvigorate collaboration between the approximation and computational geometry communities.
Learn More

Follow us on X

Subscribe to our YouTube Channel

Follow us on LinkedIn

EnCORE: The institute for Emerging CORE Methods in Data Science

The Institute for Emerging CORE Methods in Data Science, or EnCORE, is led by the
University of California San Diego in collaboration with the University of California, Los
Angeles; University of Pennsylvania; and The University of Texas at Austin. EnCORE brings
together scientists from multiple disciplines such as statistics, mathematics, electrical
engineering, theoretical computer science, machine learning and health science, among others.

EnCORE’s team will focus on the four CORE pillars of data science: C for complexities of
data, O for optimization, R for responsible learning, and E for education and engagement.
The institute is fostering a plan for outreach and broadening participation by fostering supportive environments at all levels, from K-12 to postdocs and junior faculty. The project aims to reach a wide demography of students by offering collaborative courses across its partner universities and a flexible co-mentorship plan for multidisciplinary research.To bring theoretical development into practice, EnCORE will work with industry partners and domain scientists and will forge strong connections with other NSF Harnessing the Data Revolution Institutes across the nation.

NSF’s Harnessing the Data Revolution (HDR) Big Idea is a national scale activity to enable new modes of data-driven discovery that will allow fundamental questions to be asked and answered at the frontiers of science and  engineering. HDR TRIPODS aims to bring together the electrical engineering, mathematics, statistics, and theoretical computer science communities together through integrated research and training activities. 

Image

Tractable Agreement Protocols via Calibration

Natalie Collina (UPenn)

Abstract:

We give an efficient reduction through which any machine learning algorithm can be converted into an interactive protocol that can interact with a human decision maker to improve their predictions. In the interactive protocol, the machine learning model first produces a prediction, which implies a recommended action for the human decision maker. The human can update their beliefs as a function of the recommendation they received, but can also incorporate their own knowledge and observations, and so may disagree with the provided recommendation. The human then responds to the model’s recommendation by conveying either agreement with the model’s recommendation, or directional disagreement. The model then updates its state and provides a new recommendation, and the human may in turn again update their beliefs given the new information they have learned from the model. The process continues until the model and the human reach agreement. We show that any predictive model can be efficiently turned into an agreement protocol (without reducing its accuracy) so that the number of rounds until the protocol reaches agreement is small, under forgiving, computationally tractable assumptions on the human’s decision making process. Additionally, the final decisions are guaranteed to be more accurate than either those of the model or of the human on their own. In fact, the assumptions we make on the human would be satisfied by a Bayesian decision maker, but are a substantial relaxation that makes them algorithmically tractable and do not require perfect Bayesian rationality. In this way we generalize Aumann’s agreement theorem from requiring Bayesian rationality to requiring only tractable calibration conditions, and generalize known quantitative versions of the theorem to hold for multi-dimensional expectations and more general forms of communication.

FairProof : Confidential and Certifiable Fairness for Neural Networks

Chhavi Yadav (UCSD)

Abstract:

Machine learning models are increasingly used in societal applications, yet legal and privacy concerns demand that they very often be kept confidential. Consequently, there is a growing distrust about the fairness properties of these models in the minds of consumers, who are often at the receiving end of model predictions. To this end, we propose Fairproof – a system that uses Zero-Knowledge Proofs (a cryptographic primitive) to publicly verify the fairness of a model, while maintaining confidentiality. We also propose a fairness certification algorithm for fully-connected neural networks which is befitting to ZKPs and is used in this system. We implement Fairproof in Gnark and demonstrate empirically that our system is practically feasible.

Distances on Markov Chains and their Differentiation

Tristan Brugere (UCSD)
Abstract: (Directed) graphs with node attributes are a common type of data in various applications and there is a vast literature on developing metrics and efficient algorithms for comparing them.

Recently, in the graph learning and optimization communities, a range of new approaches have been developed for comparing graphs with node attributes, leveraging ideas such as the Optimal Transport (OT) and the Weisfeiler-Lehman (WL) graph isomorphism test.
Two state-of-the-art representatives are the OTC distance proposed in (O’Connor et al., 2022) and the WL distance in (Chen et al., 2022).

Interestingly, while these two distances are developed based on different ideas, we observe that they both view graphs as Markov chains and are deeply connected. Indeed, in this paper, we propose a unified framework to generate distances for Markov chains (thus including (directed) graphs with node attributes), which we call the Optimal Transport Markov (OTM) distances, that encompass both the OTC and the WL distances. We further introduce a special one-parameter family of distances within our OTM framework, called the discounted WL distance. We show that the discounted WL distance has nice theoretical properties and can address several limitations of the existing OTC and WL distances. Furthermore, contrary to the OTC and the WL distances, our new discounted WL distance can be differentiated after an entropy-regularization similar to the Sinkhorn distance, making it suitable to use in learning frameworks, e.g., as the reconstruction loss in a graph generative model.

Nonlinear Spiked Random Matrix Models: From Signal Recovery to Deep Learning Theory

Behrad Moniri (UPenn)
Abstract: In this talk, we first propose a nonlinear spiked random matrix model where a nonlinear function is applied elementwise to a rank-one perturbation of a random matrix. We then study the spectrum of this model in different regimes of signal strength. Finally, we use these results to study low-rank signal recovery given nonlinear noisy observations, as well as feature learning in two-layer neural networks after one step of gradient descent.
 
Bio: Behrad is a PhD student in the Department of Electrical and Systems Engineering at the University of Pennsylvania. His current research interests are deep learning theory, probability theory, and information theory. Behrad received his B.Sc. degree in Electrical Engineering from the Sharif University of Technology in 2020 with highest distinctions.

Building Collaborative Intelligence

Sai Praneeth Karimireddy (UC Berkeley)

Abstract: How do we build ML systems that put the interests of users and society front and center? Using collaborations with Doctors Without Borders and the Cancer Registry of Norway as case studies, we will show how collaborative learning (CL) can be used to build such user-centric ML systems. CL techniques (including decentralized and federated learning) allow us to train ML models without the users sacrificing their privacy or ownership and control over their data.  Yet for these systems to truly succeed, two fundamental challenges must be confronted. These systems need to be efficient and scale to massive networks, and manage and resolve the conflicting goals of the participants. We will discuss how tools from optimization, statistics, and economics can be leveraged to address these challenges.

Bio: Sai Praneeth Karimireddy is a postdoc at UC Berkeley with Mike I. Jordan and obtained his PhD from EPFL with Martin Jaggi. His research builds large-scale collaborative learning systems and has seen widespread real-world adoption both by public health organizations (e.g., Doctors Without Borders, the Red Cross, the Cancer Registry of Norway) and by industries such as MetaGoogleOpenAI, and Owkin. His research has been recognized by the EPFL Patrick Denantes Memorial Prize for the best thesis in computer science, the Chorafas Foundation Award for exceptional applied research, an EPFL thesis distinction award, an SNSF fellowship, and multiple best paper awards.

Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget

Harshvardhan (UCSD)

Abstract: Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods (e.g. federated learning). In light of these applications, recently there has been an interest in distributed mean estimation in a communication-constrained setting, where vectors have to be compressed before sharing. However,  in these applications the vectors whose mean are to be estimated are often correlated with each other. To exploit these correlations, recently Suresh et al., 2022, Jhunjhunwala et al., 2021, Jiang et al, 2023, proposed multiple  correlation aware compression schemes. However, any theoretical analysis of graceful degradation of these correlation-aware compression schemes with increasing heterogeneity (de-correlation) is absent in  the literature. Moreover, the correlations have to be known for these schemes to work. In this paper, we propose collaborative compression schemes that agnostically exploit the correlation among vectors in a distributed setting. We propose modifications of 4 different compression schemes to make them suitable for distributed mean estimation. Our schemes are all based on and/or variants of popular sign-based compression (signSGD and 1 bit compressed sensing). We do a rigorous theoretical analysis of our proposed schemes to show how the estimation error varies with the degree of correlation among vectors. In the process, we come up with appropriate correlation measures for these applications as well. Further, we compare the performance of our schemes to existing benchmarks for distributed mean estimation and for downstream distributed learning tasks.

No Fear of Large Number of Local Steps: Leveraging Implicit Bias on FedAvg Algorithm

Heng Zhu (UCSD)

Abstract: In federated learning, a large number of local steps is usually avoided due to data heterogeneity. However, in this work we leverage the implicit bias of gradient descent to show that a federated approach involving a large number of local steps can closely approximate the centralized solution obtained by training on aggregated data. We investigate  this phenomenon through experiments conducted on synthetic data using both linear and logistic regression models. Furthermore, we offer a theoretical explanation for the linear regression task.

Product Manifold Learning with Independent Coordinate Selection

Jesse He (UCSD)

Abstract: In many dimensionality reduction tasks, we wish to identify the constituent components that explain our observations. For manifold learning, this can be formalized as factoring a Riemannian product manifold. Recovering this factorization, however, may suffer from certain difficulties in practice, especially when data is sparse or noisy, or when one factor is distorted by the other. To address these limitations, we propose identifying non-redundant coordinates on the product manifold before applying product manifold learning to identify which coordinates correspond to different factor manifolds.

Tear and Repulsion Enabled Registration of Point Clouds for Manifold Learning

Dhruv Kohli (UCSD)

Abstract: We present a framework for aligning the local views of a possibly closed/non-orientable data manifold to produce an embedding in its intrinsic dimension through tearing. Through a spectral coloring scheme, we render the embeddings of the points across the tear with matching colors, enabling a visual recovery of the topology of the data manifold. The embedding is further equipped with a tear-aware metric that enables computation of shortest paths while accounting for the tear. To measure the quality of an embedding, we propose two Lipschitz-type notions of global distortion—a stronger and a weaker one—along with their pointwise counterparts for a finer assessment of the embedding. Subsequently, we bound them using the distortion of the local views and the alignment error between them. We show that our theoretical result on strong distortion leads to a new perspective on the need for a repulsion term in manifold learning objectives. As a result, we enhance our alignment approach by incorporating repulsion. Finally, we compare various strategies for the tear and repulsion enabled alignment, with regard to their speed of convergence and the quality of the embeddings produced.


This is joint work with my advisors Gal Mishne and Alex Cloninger at UCSD.

Efficient Online Clustering with Moving Costs

Dimitrios Christou (UT Austin)

Abstract: In this talk, I will consider an online-learning problem, called Online Clustering with Moving Costs, at which a learner maintains a set of facilities over rounds so as to minimize the connection cost of an adversarially selected sequence of clients. The learner is informed on the positions of the clients at each round only after its facility-selection and can use this information to update its decision in the next round. However, updating the facility positions comes with an additional moving cost based on the moving distance of the facilities. I will be presenting the first (polynomial-time) approximate-regret algorithm for this setting through a combination of different algorithmic techniques such as HST embeddings, the FTRL framework with a dilated entropic regulariser as well as a novel rounding scheme.

Encoding Structural Symmetry is Key for Length Generalization in Arithmetic Tasks

Mahdi Sabbaghi (UPenn)

Abstract: Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to (length) generalize on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure of numbers versus text; for example, numbers are associated with a specific significance order that plays a role in calculating the answer. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via appropriate data formatting and custom positional encodings. To further elucidate the importance of explicitly encoding structure, in a simplified linear setting, we prove that standard positional encodings even when trained with augmentations to implicitly induce structure fail at such generalization, whereas enforcing structure via positional encodings succeeds.

Bio: Mahdi Sabbaghi is a second year PhD student at UPenn, department of Electrical and System Engineering, supervised by Professors Hamed Hassani and George Pappas. Previously he obtained a B. Sc. degree in Electrical Engineering as well as a B. Sc. degree in Physics from the Sharif University of Technology, in Tehran.

Private Estimation of U Statistics

Shourya Pandey (UT Austin)

Abstract: We consider the problem of private estimation of U statistics. U statistics are widely used estimators that naturally arise in a broad class of problems, from nonparametric signed rank tests to subgraph counts in random networks. Despite the recent outpouring of interest in private mean estimation, private algorithms for more general U statistics have received little attention.  We propose a framework where, for a broad class of U statistics, one can use existing tools in private mean estimation to obtain confidence intervals where the private error does not overwhelm the irreducible error resulting from the variance of the U statistics. However, in specific cases that arise when the U statistics degenerate or have vanishing moments, the private error may be of a larger order than the non-private error. To remedy this, we propose a new thresholding-based approach that uses Hajek projections to re-weight different subsets. As we show,  this leads to more accurate inference in certain settings.

Finite-Time Logarithmic Bayes Regret Upper Bounds

Alexia Atsidakou (UT Austin)

Abstract: We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits, for BayesUCB and Thompson Sampling. In Gaussian and Bernoulli multi-armed bandits, we obtain $O(c_\Delta \log n)$ and $O(c_h \log^2 n)$ upper bounds for an upper confidence bound algorithm, where $c_h$ and $c_\Delta$ are constants depending on the prior distribution and the gaps of bandit instances sampled from it, respectively. The latter bound asymptotically matches the lower bound of Lai (1987). Our proofs are a major technical departure from prior works, while being simple and general. The key idea in our proofs is to conduct a frequentist per-instance analysis with Bayesian confidence intervals, and then integrate it over the prior.

Our results provide insights on the value of prior in the Bayesian setting, both in the objective and as a side information given to the learner. They significantly improve upon existing $\tilde{O}(\sqrt{n})$ bounds, which have become standard in the literature despite the logarithmic lower bound of Lai (1987).

Pareto-Optimal Algorithms for Learning in Games

Natalie Collina and Eshwar Ram Arunachaleswaran

Abstract: We study the problem of characterizing optimal learning algorithms for playing repeated games against an adversary with unknown payoffs. In this problem, the first player (called the learner) commits to a learning algorithm against a second player (called the optimizer), and the optimizer best-responds by choosing the optimal dynamic strategy for their (unknown but well-defined) payoff. Classic learning algorithms (such as no-regret algorithms) provide some counterfactual guarantees for the learner, but might perform much more poorly than other learning algorithms against particular optimizer payoffs.

In this paper, we introduce the notion of asymptotically Pareto-optimal learning algorithms. Intuitively, if a learning algorithm is Pareto-optimal, then there is no other algorithm which performs asymptotically at least as well against all optimizers and performs strictly better (by at least $\Omega(T)$) against some optimizer. We show that well-known no-regret algorithms such as Multiplicative Weights and Follow The Regularized Leader are Pareto-dominated. However, while no-regret is not enough to ensure Pareto-optimality, we show that a strictly stronger property, no-swap-regret, is a sufficient condition for Pareto-optimality.

Proving these results requires us to address various technical challenges specific to repeated play, including the fact that there is no simple characterization of how optimizers who are rational in the long-term best-respond against a learning algorithm over multiple rounds of play. To address this, we introduce the idea of the asymptotic menu of a learning algorithm: the convex closure of all correlated distributions over strategy profiles that are asymptotically implementable by an adversary. Interestingly, we show that all no-swap-regret algorithms share the same asymptotic menu, implying that all no-swap-regret algorithms are “strategically equivalent”.

This talk is based on work with Jon Schneider.

Metric Learning from Lazy Crowds

Geelon So

Abstract: We consider crowd-based metric learning from preference comparisons, where given two items, a user prefers the item that is closer to their latent ideal item. Here, “closeness” is measured with respect to a shared but unknown Mahalanobis distance. Can we recover this distance when we can only obtain very few responses per user?

In this very low-budget regime, we show that generally, nothing at all about the metric is revealed, even with infinitely many users. But when the items have subspace cluster structure, we present a divide-and-conquer approach for metric recovery, and provide theoretical recovery guarantees and empirical validation.

This is joint work with Zhi Wang (UCSD) and Ramya Korlakai Vinayak (UW–Madison).

Random Walks, Conductance, and Resistance for the Connection Graph Laplacian

Zhengchao Wan

Abstract: We investigate the concept of effective resistance in connection graphs, expanding its traditional application from undirected graphs. We propose a robust definition of effective resistance in connection graphs by focusing on the duality of Dirichlet-type and Poisson-type problems on connection graphs. Additionally, we delve into random walks, taking into account both node transitions and vector rotations. This approach introduces novel concepts of effective conductance and resistance matrices for connection graphs, capturing mean rotation matrices corresponding to random walk transitions. Thereby, it provides new theoretical insights for network analysis and optimization.

This is based on a joint work with Alexander Cloninger, Gal Mishne, Andreas Oslandsbotn, Sawyer Jack Robertson and Yusu Wang.

Approximability and Inapproximability of Strict-CSPs

Akbar Rafiey

Abstract: We study the approximability and inapproximability of Strict-CSPs. An instance of the Strict-CSPs consists of a set of constraints over a set of variables and a cost function over the assignments. The goal is to find an assignment to the variables of minimum cost which satisfies all the constraints. Some prominent problems that this framework captures are (Hypergraph) Vertex Cover, Min Sum k-Coloring, Multiway Cut, Min Ones, and others.

We focus on a systematic study of Strict-CSPs of the form Strict-CSPs(H), that is, Strict-CSPs where the type of constraints is limited to predicates from a set H. Our first result is a dichotomy for approximation of Strict-CSPs(H), where H is a binary predicate, i.e., a digraph. We prove that if digraph H has bounded width, then Strict-CSPs(H) is approximable within a constant factor (depending on H); otherwise, there is no approximation for Strict-CSPs(H) unless P=NP.

Second, we study the inapproximability of Strict-CSP and present the first general hardness of approximation for Strict-CSP. More precisely, we prove a dichotomy theorem that states every instance of Strict-CSP(H) (H being a digraph) is either polynomial-time solvable or APX-complete. Moreover, we show the existence of a universal constant 0<\delta<1 such that it is NP-hard to approximate Strict-CSP(H) within a factor of (2-\delta) for all digraphs H where Strict-CSP(H) is NP-complete.

Buy-many Mechanisms for Many Unit-demand Buyers

Rojin Rezvan

Abstract: A recent line of research has established a novel desideratum for designing approximatelyrevenue-optimal multi-item mechanisms, namely the buy-many constraint. Under this constraint, prices for different allocations made by the mechanism must be subadditive implying that the price of a bundle cannot exceed the sum of prices of individual items it contains. This natural constraint has enabled several positive results in multi-item mechanism design bypassing well-established impossibility results. Our work addresses a main open question from this literature involving the design of buymany mechanisms for multiple buyers. Our main result is that a simple sequential item pricing mechanism with buyer-specific prices can achieve an O(log m) approximation to the revenue of any buy-many mechanism when all buyers have unit-demand preferences over m items. This is the best possible as it directly matches the previous results for the single-buyer setting where no simple mechanism can obtain a better approximation. Our result applies in full generality: even though there are many alternative ways buy-many mechanisms can be defined for multibuyer settings, our result captures all of them at the same time. We achieve this by directly competing with a more permissive upper-bound on the buy-many revenue, obtained via an ex-ante relaxation.

Streaming PCA for Markovian Data

Syamantak Kumar

Abstract: Since its inception in 1982, Oja’s algorithm has become an established method for streaming principle component analysis (PCA). We study the problem of streaming PCA, where the data-points are sampled from an irreducible, aperiodic, and reversible Markov chain. Our goal is to estimate the top eigenvector of the unknown covariance matrix of the stationary distribution. This setting has implications in scenarios where data can solely be sampled from a Markov Chain Monte Carlo (MCMC) type algorithm, and the objective is to perform inference on parameters of the stationary distribution. Most convergence guarantees for Oja’s algorithm in the literature assume that the data-points are sampled IID. For data streams with Markovian dependence, one typically downsamples the data to get a “nearly” independent data stream. In this paper, we obtain the first sharp rate for Oja’s algorithm on the entire data, where we remove the logarithmic dependence on the sample size, resulting from throwing data away in downsampling strategies.

A d^{1/2 + o(1)} Monotonicity Tester for Boolean Functions on d-Dimensional Hypergrids

Hadley Black

Abstract: Monotonicity testing of Boolean functions over the n-width, d-dimensional hypergrid is a classic problem in property testing, where the goal is to design a randomized algorithm which can distinguish monotone functions from those which are far from any monotone function while making as few queries as possible. The special case of n = 2 corresponds to the hypercube domain. Here a long line of works exploiting a very interesting connection with isoperimetric inequalities for Boolean functions culminated in a non-adaptive tester making ~O(d^{1/2}) queries in a celebrated paper by Khot, Minzer, and Safra (SICOMP 2018). This is known to be optimal for non-adaptive testers. However, the general case of hypergrids for n > 2 remained open. Very recently, two papers (Black-Chakrabarty-Seshadhri STOC 2023 and Braverman-Khot-Kindler-Minzer ITCS 2023) independently obtained ~O(poly(n) d^{1/2}) query testers for hypergrids. These results are essentially optimal for n < polylog(d), but are far from optimal for n >> polylog(d).

This talk covers our most recent result (appearing at FOCS 2023) which obtains a non-adaptive d^{1/2+o(1)} query tester for all n, resolving the non-adaptive monotonicity testing problem for hypergrids, up to a factor of d^{o(1)}. Our proof relies on many new techniques as well as two key theorems which we proved in earlier works from SODA 2020 and STOC 2023.

SmoothLLMs: Defending LLMs against Adversarial Attacks

Alex Robey

Abstract: Despite efforts to align large language models (LLMs) with human values, widely-used LLMs such as GPT, Llama, Claude, and PaLM are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content.  To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on LLMs.  Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.  SmoothLLM reduces the attack success rate on numerous popular LLMs to below one percentage point, avoids unnecessary conservatism, and admits provable guarantees on attack mitigation.  Moreover, our defense uses exponentially fewer queries than existing attacks and is compatible with any LLM.

Composition of Nested Embeddings with an Application to Outlier Removal

Kristin Sheridan

Abstract: We study the design of embeddings into Euclidean space with outliers. Given a metric space $(X,d)$ and an integer $k$, the goal is to embed all but $k$ points in $X$ (called the “”outliers””) into $\ell_2$ with the smallest possible distortion $c$. Finding the optimal distortion $c$ for a given outlier set size $k$, or alternately the smallest $k$ for a given target distortion $c$ are both NP-hard problems. In fact, it is UGC-hard to approximate $k$ to within a factor smaller than $2$ even when the metric sans outliers is isometrically embeddable into $\ell_2$. We consider bi-criteria approximations. Our main result is a polynomial time algorithm that approximates the outlier set size to within an $O(\log^2 k)$ factor and the distortion to within a constant factor.

The main technical component in our result is an approach for constructing a composition of two given embeddings from subsets of $X$ into $\ell_2$ which inherits the distortions of each to within small multiplicative factors. Specifically, given a low $c_S$ distortion embedding from $S\subset X$ into $\ell_2$ and a high(er) $c_X$ distortion embedding from the entire set $X$ into $\ell_2$, we construct a single embedding that achieves the same  distortion $c_S$ over pairs of points in $S$ and an expansion of at most $O(\log k)\cdot c_X$ over the remaining pairs of points, where $k=|X\setminus S|$. Our composition theorem extends to embeddings into arbitrary $\ell_p$ metrics for $p\ge 1$, and may be of independent interest. While unions of embeddings over disjoint sets have been studied previously, to our knowledge, this is the first work to consider compositions of {\em nested} embeddings.

Graph Sparsification by Approximate Matrix Multiplication

Neo Charalambides

Abstract:  Graphs arising in statistical problems, signal processing, large networks, combinatorial optimization, and data analysis are often dense, which causes both computational and storage bottlenecks. One way of sparsifying a weighted graph, while sharing the same vertices as the original graph but reducing the number of edges, is through spectral sparsification. We study this problem through the perspective of RandNLA. Specifically, we utilize randomized matrix multiplication to give a clean and simple analysis of how sampling according to edge weights gives a spectral approximation to graph Laplacians, without requiring spectral information. Through the CR−MM algorithm, we attain a simple and computationally efficient sparsifier whose resulting Laplacian estimate is unbiased and of minimum variance. Furthermore, we define a new notion of additive spectral sparsifiers, which has not been considered in the literature.

Talk Title

Student Name

Abstract