My PhD dissertation, Algorithmic Approaches to Statistical Questions, 2012. (ACM Doctoral Dissertation Award, Honorable Mention.) [pdf]
We describe two algorithms for multiplying n x n matrices using time and energy n^2 polylog(n) under basic models of classical physics. The first algorithm is for multiplying integer-valued matrices, and the second, quite different algorithm, is for Boolean matrix multiplication. We hope this work inspires a deeper consideration of physically plausible/realizable models of computing that might allow for algorithms which improve upon the runtimes and energy usages suggested by the parallel RAM model in which each operation requires one unit of time and one unit of energy.
@inproceedings{V24,
title = "Matrix Multiplication in Quadratic Time and Energy? Towards a Fine-Grained Energy-Centric Church-Turing Thesis",
author = "Gregory Valiant",
booktitle="Innovations in Theoretical Computer Science (ITCS)",
year = "2024",
}
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without any fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the a priori identity of any token. To answer this, we study "lexinvariant" language models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications.
@inproceedings{HZCWVL23,
title = "Lexinvariant Language Models",
author = "Qian Huang and Eric Zelikman and Sarah Li Chen and Yuhuai Wu and Gregory Valiant and Percy Liang",
booktitle="NeurIPS",
year = "2023",
}
We show that any memory-constrained, first-order algorithm which minimizes d-dimensional, 1-Lipschitz convex functions over the unit ball to 1/poly(d) accuracy using at most d^(1.25 - delta) bits of memory must make at least d^(1+(4/3)delta) first-order queries (for any constant delta in [0, 1/4]). Consequently, the performance of such memory-constrained algorithms are a polynomial factor worse than the optimal d polylog d query bound for this problem obtained by cutting plane methods that use memory at least d^2. This resolves one of the open problems in the COLT 2019 open problem publication of Woodworth and Srebro.
@inproceedings{QV21,
title = "Efficient Convex Optimization Requires Superlinear Memory",
author = "Annie Marsden and Vatsal Sharan and Aaron Sidford and Gregory Valiant",
booktitle="COLT",
year = "2022",
}
In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms.
@inproceedings{GTLV22,
title = "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes",
author = "Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant",
booktitle="NeurIPS",
year = "2022",
}
The famous "laurel/yanny'' phenomenon references an audio clip that elicits dramatically different responses from different listeners. For the original clip, roughly half the population hears the word "laurel," while the other half hears "yanny." How common are such "polyperceivable" audio clips? In this paper we apply ML techniques to study the prevalence of polyperceivability in spoken language. We devise a metric that correlates with polyperceivability of audio clips, use it to efficiently find new "laurel/yanny" type examples, and validate these results with human experiments. Our results suggest that polyperceivable examples are surprisingly prevalent, existing for at least 2% of English words
@inproceedings{QV21,
title = "Beyond Laurel/Yanny: An Autoencoder-Enabled Search for Polyperceivable Audio",
author = "Kartik Chandra and Chuma Kabaghe and Gregory Valiant",
booktitle="ACL",
year = "2021",
}
We consider an online binary prediction setting where a forecaster observes a sequence of T bits one by one. Before each bit is revealed, the forecaster predicts the probability that the bit is 1. The forecaster is called well-calibrated if for each p in [0,1], among the n_p bits for which the forecaster predicts probability p, the actual number of ones, m_p, is indeed equal to p*n_p. The calibration error, defined as \sum_p |m_p - p n_p|, quantifies the extent to which the forecaster deviates from being well-calibrated. It has long been known that an O(T^(2/3)) calibration error is achievable even when the bits are chosen adversarially, and possibly based on the previous predictions. However, little is known on the lower bound side, except an sqrt(T) bound that follows from the trivial example of independent fair coin flips. In this paper, we prove an T^(0.528) bound on the calibration error, which is the first bound above the trivial sqrt(T) lowerbound for this setting. The technical contributions of our work include two lower bound techniques, early stopping and sidestepping, which circumvent the obstacles that have previously hindered strong calibration lower bounds. We also propose an abstraction of the prediction setting, termed the Sign-Preservation game, which may be of independent interest. This game has a much smaller state space than the full prediction setting and allows simpler analyses. The T^0.528 lower bound follows from a general reduction theorem that translates lower bounds on the game value of Sign-Preservation into lower bounds on the calibration error.
@inproceedings{QV21,
title = "Stronger Calibration Lower Bounds via Sidestepping",
author = "Mingda Qiao and Gregory Valiant",
booktitle="STOC",
year = "2021",
}
We study probabilistic prediction games when the underlying model is misspecified, investigating the consequences of predicting using an incorrect parametric model. We show that for a broad class of loss functions and parametric families of distributions, the regret of playing a ``proper'' predictor--one from the putative model class--relative to the best predictor in the same model class has lower bound scaling at least as sqrt(c n), where c is a measure of the model misspecification to the true distribution in terms of total variation distance. In contrast, using an aggregation-based (improper) learner, one can obtain regret d log n for any underlying generating distribution, where d is the dimension of the parameter; we exhibit instances in which this is unimprovable even over the family of all learners that may play distributions in the convex hull of the parametric family. These results suggest that simple strategies for aggregating multiple learners together should be more robust, and several experiments conform to this hypothesis.
@inproceedings{MDV21,
title = "Misspecification in Prediction Problems and Robustness via Improper Learning",
author = "Annie Marsden and John Duchi and Gregory Valiant",
booktitle="AISTATS",
year = "2021",
}
We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements {1,..,n} with corresponding values x1,..., xn. We observe the values for a sample set A \subset {1, . . . , n} and wish to estimate some statistic of the values for a target set B \subset {1, . . . , n} where B could be the entire set. Crucially, we assume that the sets A and B are drawn according to some known joint distribution P over pairs of subsets of {1, . . . , n}. A given estimation algorithm is evaluated based on its worst-case, expected error where the expectation is with respect to the distribution P from which the sample A and target set B are drawn, and the worst-case is with respect to the data values x1, . . . , xn. Within this general framework we give an efficient algorithm to find an estimator for the target mean, as a weighted combination of the input sample?where the weights are a function of the distribution P and the identities of the elements in the sample and target sets A, B. We show that the worst-case expected error achieved by this estimator is at most a multiplicative pi/2 factor worse than the optimum for such estimators. A component of this algorithm can also be used to approximate the worst-case expected error of a given estimator. The algorithm and proof leverage a surprising connection to the Grothendieck problem. We extend these results to the setting of linear regression, where each datapoint is not a scalar but a labeled vector. Our framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process via the distribution P, is a significant departure from the typical statistical estimation framework and introduces a uniform algorithmic analysis for the many natural settings where membership in a sample may be correlated with data values, such as when probabilities of sampling vary as in importance sampling, when individuals are recruited into a sample via a social network as in ``snowball sampling'' or ``respondent-driven sampling'' or when samples have chronological structure as in ``selective prediction''. We experimentally demonstrate the benefit of this framework and our algorithm in comparison to standard estimators, for several such settings.
@inproceedings{CVV20,
title = "Worst-Case Analysis for Randomly Collected Data",
author = "Justin Y. Chen, Gregory Valiant, and Paul Valiant",
booktitle="NeurIPS",
year = "2020",
}
Given data drawn from an unknown distribution, D, to what extent is it possible to "amplify" this dataset and output an even larger set of samples that appear to have been drawn from D? We formalize this question as follows: an (n,m) amplification procedure takes as input n independent draws from an unknown distribution D, and outputs a set of m>n "samples". An amplification procedure is valid if no algorithm can distinguish the set of m samples produced by the amplifier from a set of m independent draws from D, with probability greater than 2/3. Perhaps surprisingly, in many settings, a valid amplification procedure exists, even when the size of the input dataset, n, is significantly less than what would be necessary to learn D to non-trivial accuracy. Specifically we consider two fundamental settings: the case where D is an arbitrary discrete distribution supported on at most k elements, and the case where D is a d-dimensional Gaussian with unknown mean, and fixed covariance. In the first case, we show that an (n, n+O(n/sqrt(k))) amplifier exists. In particular, given n=O(sqrt(k)) samples from D, one can output a set of m=n+1 datapoints, whose total variation distance from the distribution of m i.i.d. draws from D is a small constant, despite the fact that one would need quadratically more data to learn D up to small constant total variation distance. In the Gaussian case, we show that an (n, n+O(n/sqrt(d))) amplifier exists, even though learning the distribution to small constant total variation distance requires O(d) samples. In both the discrete and Gaussian settings, we show that these results are tight, to constant factors. Beyond these results, we formalize a number of curious directions for future research along this vein.
@inproceedings{BGVV19,
title = "Sample Amplification: Increasing Dataset Size even when Learning is Impossible",
author = "Brian Axelrod, Shivam Garg, Vatsal Sharan, and Gregory Valiant",
booktitle="International Conference on Machine Learning (ICML)",
year = "2020",
}
We consider networks, trained via stochastic gradient descent to minimize l2 loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the datapoints, of the squared l2 norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards ?simple? models.
@inproceedings{BGVV20,
title = "Implicit regularization for deep neural networks driven by an {O}rnstein-{U}hlenbeck like process",
author = "Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant",
booktitle="Conference on Learning Theory (COLT)",
year = "2020",
}
Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used--the EU?s Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual?s data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably efficient deletion algorithms which achieve an average of over 100◊ improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.
@inproceedings{GGVZ19,
title = "Making AI Forget You: Data Deletion in Machine Learning",
author = "Antonio A. Ginart, Melody Y. Guan, Gregory Valiant, and James Zou, Making AI Forget You: Data Deletion in Machine Learning",
booktitle="NeurIPS",
year = "2019",
}
We consider the problem of performing linear regression over a stream of $d$-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples $(a_1,b_1), (a_2,b_2)\ldots,$ with $a_i$ drawn independently from a $d$-dimensional isotropic Gaussian, and where $b_i = \langle a_i, x\rangle + \eta_i,$ for a fixed $x \in \mathbb{R}^d$ with $\|x\|_2 = 1$ and with independent noise $\eta_i$ drawn uniformly from the interval $[-2^{-d/5},2^{-d/5}]$. We show that any algorithm with at most $d^2/4$ bits of memory requires at least $\Omega(d \log \log \frac{1}{\epsilon})$ samples to approximate $x$ to $\ell_2$ error $\epsilon$ with probability of success at least $2/3$, for $\epsilon$ sufficiently small as a function of $d$. In contrast, for such $\epsilon$, $x$ can be recovered to error $\epsilon$ with probability $1-o(1)$ with memory $O\left(d^2 \log(1/\epsilon)\right)$ using $d$ examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization.
@inproceedings{SSV19,
title = "Memory-Sample Tradeoffs for Linear Regression with Small Error",
author = "Vatsal Sharan and Aaron Sidford and Gregory Valiant",
booktitle="STOC",
year = "2019",
}
We consider a model of selective prediction, where the prediction algorithm is given a data sequence in an online fashion and asked to predict a pre-specified statistic of the upcoming data points. The algorithm is allowed to choose when to make the prediction as well as the length of the prediction window, possibly depending on the observations so far. We prove that, even without any distributional assumption on the input data stream, a large family of statistics can be estimated to non-trivial accuracy. To give one concrete example, suppose that we are given access to an arbitrary binary sequence x_1,...,x_n of length n. Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some t < n and m < n - t, after seeing t observations we predict the average of x_{t+1},...,x_{t+m}. This particular problem was first studied in [Dru13] and referred to as the "density prediction game". We show that the expected squared error of our prediction can be bounded by O(1/logn) and prove a matching lower bound, which resolves an open question raised in [Dru13]. This result holds for any sequence (that is not adaptive to when the prediction is made, or the predicted value), and the expectation of the error is with respect to the randomness of the prediction algorithm. Our results apply to more general statistics of a sequence of observations, and we highlight several open directions for future work.
@inproceedings{QV19,
title = "A Theory of Selective Prediction",
author = "Mingda Qiao and Gregory Valiant",
booktitle="COLT",
year = "2019",
}
We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is often possible to accurately estimate this "learnability" even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a d-dimensional distribution with isotropic covariance (or known covariance), and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with O(sqrt(d)) samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. In contrast to this sublinear sample size, finding an approximation of the best-fit linear function requires on the order of d samples. Our sublinear sample results and approach also extend to the non-isotropic setting, where the data distribution has an (unknown) arbitrary covariance matrix: we show that, if the label y of point x is a linear function with independent noise, the variance of the noise can be estimated to error epsilon with O(d^(1-1/log(1/eps))) samples if the covariance matrix has bounded condition number, or O(d^(1-sqrt(eps))) if there are no bounds on the condition number. We also establish that these sample complexities are optimal, to constant factors. Finally, we extend these techniques to the setting of binary classification, where we obtain analogous sample complexities for the problem of estimating the prediction error of the best linear classifier, in a natural model of binary labeled data. We demonstrate the practical viability of our approaches on several real and synthetic datasets.
@inproceedings{KV18,
title = "Estimating Learnability in the Sublinear Data Regime",
author = "Weihao Kong and Gregory Valiant",
booktitle="NeurIPS",
year = "2018",
}
Given the apparent difficulty of learning models that are robust to adversarial perturbations, we propose tackling the simpler problem of developing adversarially robust features. Specifically, given a dataset and metric of interest, the goal is to return a function (or multiple functions) that 1) is robust to adversarial perturbations, and 2) has significant variation across the datapoints. We establish strong connections between adversarially robust features and a natural spectral property of the geometry of the dataset and metric of interest. This connection can be leveraged to provide both robust features, and a lower bound on the robustness of any function that has significant variance across the dataset. Finally, we provide empirical evidence that the adversarially robust features given by this spectral approach can be fruitfully leveraged to learn a robust (and accurate) model.
@inproceedings{KV18,
title = "A Spectral View of Adversarially Robust Features",
author = "Shivam Garg, Vatsal Sharan, Brian Hu Zhang, and Gregory Valiant",
booktitle="NeurIPS",
year = "2018",
}
We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by $I$, then a simple Markov model over the most recent $I/\epsilon$ observations obtains expected KL error $\epsilon$---and hence $\ell_1$ error $\sqrt{\epsilon}$---with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with $n$ hidden states, $I$ is bounded by $\log n$, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length $O(\log n/\epsilon)$ windows of observations achieves this error, provided the length of the sequence is $d^{\Omega(\log n/\epsilon)}$, where $d$ is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with $n$ hidden states, a window length of $\log n/\epsilon$ is information-theoretically necessary to achieve expected KL error $\epsilon$, or $\ell_1$ error $\sqrt{\epsilon}$. Second, the $d^{\Theta(\log n/\epsilon)}$ samples required to accurately estimate the Markov model when observations are drawn from an alphabet of size $d$ is necessary for any computationally tractable learning/prediction algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
@inproceedings{SKLV18,
title = "Prediction with a Short Memory",
author = "Vatsal Sharan, Sham Kakade, Percy Liang, and Gregory Valiant",
booktitle="STOC",
year = "2018",
}
We consider a simple model of unreliable or crowdsourced data where there is an underlying set of $n$ binary variables, each evaluator contributes a (possibly unreliable or adversarial) estimate of the values of some subset of $r$ of the variables, and the learner is given the true value of a constant number of variables. We show that, provided an $\alpha$-fraction of the evaluators are ``good'' (either correct, or with independent noise rate $p < 1/2$), then the true values of a $(1-\eps)$ fraction of the $n$ underlying variables can be deduced as long as $r > \log_{2-2p}(1/\alpha)$. For example, if the fraction of ``good'' evaluators is larger than $1/16$ and there is no noise in their responses, then accurate recovery is possible provided each worker evaluates a random set of $4$ items. This result is optimal in that if $r \leq \log_{2-2p}(1/\alpha)$ the large dataset can contain no information. This setting can be viewed as an instance of the "semi-verified" learning model introduced in CSV17, which explores the tradeoff between the number of items evaluated by each worker and the fraction of "good'' evaluators. In the standard adversarial setting, our algorithm requires $\tilde{O}\left(n^{\log_{2-2p}(1/\alpha)}\right)$ evaluators. However, the algorithm runs in near linear time, $\tilde{O}_{r,\eps}(n)$, and hence would require only a near-linear number of evaluations in the weaker model in which the adversary's responses to each $r$-tuple of items are independent of the set of evaluations collected. These settings and results can also be viewed as examining a general class of semi-adversarial CSPs with a planted assignment. This extreme parameter regime, where the fraction of reliable data is small (inverse exponential in the amount of data provided by each source), is relevant to a number of practical settings. For example, the setting where you collect a dataset on customer preferences, with each customer specifying preferences for a small (constant) number of items, and the goal is to ascertain the preferences of a specific demographic of interest. Our results show that this large dataset (which lacks demographic information) can be leveraged together with the preferences of the demographic of interest for a constant (polynomial in $1/\alpha$ but independent of $n$), number of randomly selected items, to recover an accurate estimate of the entire set of preferences, even if the fraction of the original dataset contributed by the demographic of interest is inverse exponential in the number of preferences supplied by each customer. In this sense, our results can be viewed as a "data prism'' allowing one to extract the behavior of specific cohorts from a large, mixed, dataset.
@inproceedings{MV18,
title = "A Data Prism: Semi-Verified Learning in the Small-alpha Regime",
author = "Michela Meister and Gregory Valiant",
booktitle="Conference on Learning Theory (COLT)",
year = "2018",
}
We consider the problem of approximating the set of eigenvalues of the covariance matrix of a multivariate distribution (equivalently, the problem of approximating the ?population spectrum?), given ac- cess to samples drawn from the distribution. We consider this recovery problem in the regime where the sample size is comparable to, or even sublinear in the dimensionality of the distribution. First, we propose a theoretically optimal and computationally efficient algorithm for recovering the moments of the eigenvalues of the population covariance matrix. We then leverage this accurate moment recovery, via a Wasserstein distance argument, to accurately reconstruct the the vector of eigenvalues. Together, this yields an eigenvalue reconstruction algorithm that is asymptotically consistent as the dimensionality of the distribution and sample size tend towards infinity, even in the sublinear sample regime where the ratio of the sample size to the dimensionality tends to zero. In addition to our theoretical results, we show that our approach performs well in practice for a broad range of distributions and sample sizes.
@article{KV16,
title = "Spectrum Estimation from Samples",
author = "W. Kong and G. Valiant",
journal="Annals of Statistics",
volume = "45",
number = "5",
pages="2218--2247",
year = "2017",
}
The vast majority of theoretical results in machine learning and statistics assume that the available training data is a reasonably reliable reflection of the phenomena to be learned or estimated. Similarly, the majority of machine learning and statistical techniques used in practice are brittle to the presence of large amounts of biased or malicious data. In this work we propose two novel frameworks in which to study estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, which we term list-decodable learning, asks whether it is possible to return a list of answers, with the guarantee that at least one of them is accurate. For example, given a dataset of n points for which an unknown subset of ?n points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - \alpha)n points, is it possible to return a list of poly(1/\alpha) answers, one of which is correct? The second framework, which we term the semi-verified learning model considers the extent to which a small dataset of trusted data (drawn from the distribution in question) can be leveraged to enable the accurate extraction of information from a much larger but untrusted dataset (of which only an \alpha-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This general result has immediate implications for robust estimation in a number of settings, including for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.
@inproceedings{SVC16b,
title = "Learning from Untrusted Data",
author = "M. Charikar, J. Steinhardt, and G. Valiant",
booktitle="STOC",
year = "2017",
}
We consider a crowdsourcing model in which n workers are asked to rate the quality of n items previously generated by other workers. An unknown set of a*n workers generate reliable ratings, while the remaining workers may behave arbitrarily and possibly adversarially. The manager of the experiment can also manually evaluate the quality of a small number of items, and wishes to curate together almost all of the high-quality items with at most an eps fraction of low-quality items. Perhaps surprisingly, we show that this is possible with an amount of work required of the manager, and each worker, that does not scale with n: the dataset can be curated with O(1/(\beta \alpha^3 \eps^4)) ratings per worker and O(1/(\beta \eps^2)) ratings by the manager, where beta is the fraction of high-quality items. Our results extend to the more general setting of peer prediction, including peer grading in online classrooms.
@inproceedings{SVC16,
title = "Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction",
author = "J. Steinhardt, G. Valiant, and M. Charikar",
booktitle="NIPS (to appear)",
year = "2016",
}
If a concept class can be represented with a certain amount of memory, can it be efficiently learned with the same amount of memory? What concepts can be efficiently learned by algorithms that extract only a few bits of information from each example? We introduce a formal framework for studying these questions, and investigate the relationship between the fundamental resources of memory or communication and the sample complexity of the learning task. We relate our memory-bounded and communication-bounded learning models to the well-studied statistical query model. This connection can be leveraged to obtain both upper and lower bounds: we show strong lower bounds on learning parity functions with bounded communication (for example, that any multi-party, multiround communication protocol for learning a parity function in which each party is given at most n/4 examples, and each party can communicate at most n/16 bits over the course or the protocol, requires an exponential number of parties), as well as the first upper bounds on solving generic sparse linear regression problems with limited memory.
@inproceedings{JVW16,
title = "Memory, Communication, and Statistical Queries",
author = "J. Steinhardt, G. Valiant, and S. Wager",
booktitle="Conference on Learning Theory (COLT)",
year = "2016",
}
We consider the following basic learning task: given independent draws from an unknown distribution over a discrete support, output an approximation of the distribution that is as accurate as possible in L1 distance (equivalently, total variation distance, or ``statistical distance''). Perhaps surprisingly, it is often possible to ``de-noise'' the empirical distribution of the samples to return an approximation of the true distribution that is significantly more accurate than the empirical distribution, without relying on any prior assumptions on the distribution. We present an "instance optimal" learning algorithm which optimally performs this de-noising for every distribution for which such a de-noising is possible. More formally, given n independent draws from a distribution p, our algorithm returns a labelled vector whose expected distance from p is equal to the minimum possible expected error that could be obtained by any algorithm that knows the true unlabeled vector of probabilities of distribution p and simply needs to assign labels, up to an additive subconstant term that is independent of p and goes to zero as n gets large. This somewhat surprising result has several conceptual implications, including the fact that, for any large sample, Bayesian assumptions on the ``shape'' or bounds on the tail probabilities of a distribution over discrete support are not helpful for the task of learning the distribution. As a consequence of our techniques, we also show that given a set of n samples from an arbitrary distribution, one can accurately estimate the expected number of distinct elements that will be observed in a sample of any size up to n log n. This sort of extrapolation is practically relevant, particularly to domains such as genomics where it is important to understand how much more might be discovered given larger sample sizes, and we are optimistic that our approach is practically viable.
@inproceedings{VV15,
title = "Instance Optimal Learning of Discrete Distributions",
author = "Gregory Valiant and Paul Valiant",
booktitle="STOC, 2016 (to appear)",
year = "2016",
}
We consider the problem of closeness testing for two discrete distributions in the practically relevant setting of unequal sized samples drawn from each of them. Specifically, given a target error parameter eps >0, m1 independent draws from an unknown distribution p, and m2 draws from an unknown distribution q, we describe a test for distinguishing the case that p=q from the case that ||p?q||_1 > eps. If p and q are supported on at most n elements, then our test is successful with high probability provided m1 > n^(2/3)/eps^(4/3) and m2=Omega(max[n/(eps^2 sqrt(m1)), sqrt(n)/eps^2]); we show that this tradeoff is optimal throughout this range, to constant factors. These results extend the recent work of Chan et al. who established the sample complexity when the two samples have equal sizes, and tightens the results of Acharya et al. by polynomials factors in both n and eps. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on n states up to a log n factor that uses O(n^(3/2)t_mix) queries to a "next node" oracle, improving upon the O(n^(5/3)t_mix) query algorithm of Batu et al. Finally, we note that the core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data.
@inproceedings{BV15,
title = "Instance Optimal Learning of Discrete Distributions",
author = "Bhaswar Bhattacharya and Gregory Valiant",
booktitle="NIPS",
year = "2015",
}
We consider the problem of verifying the identity of a distribution: Given the description of a distribution, p, over a discrete support, how many samples (independent draws) must one obtain from an unknown distribution, q, to distinguish, with high probability, the case that p=q from the case that the total variation distance (L1 distance) is at least eps? We resolve this question, up to constant factors, on an instance by instance basis: there exist universal constants c, c' and a function f(p,eps) on distributions and error parameters, such that our tester distinguishes the two cases using f(p) samples with success probability 2/3 , but no tester can distinguish the case that p=q from the case that the total variation distance is at least c*eps when given c' f(p,eps) samples. The function f(p,eps) is upper bounded by the 2/3-norm of p, divided by eps^2, but is more complicated. This result significantly generalizes and tightens previous results: since distributions of support at most n have 2/3 norm bounded by sqrt(n) this result immediately shows that for such distributions, O(sqrt(n)/eps^2) samples suffice, matching the (tight) results for the case that p is the uniform distribution over support n.
The analysis of our very simple testing algorithm involves several hairy inequalities. To facilitate this analysis, we give a complete characterization of a general class of inequalities---generalizing Cauchy-Schwarz, Holder's inequality, and the monotonicity of L_p norms. Our characterization is of a, perhaps, non-traditional nature in that it uses linear programming to compute a derivation that may otherwise have to be sought through trial and error, by hand. We do not believe such a characterization has appeared in the literature, and hope its computational nature will facilitate future analyses like the one in this work.
@inproceedings{VV14,
title = "An Automatic Inequality Prover and Instance Optimal Identity Testing",
author = "Gregory Valiant and Paul Valiant",
booktitle="FOCS",
year = "2014",
}
We study the question of closeness testing for two discrete distributions. More precisely, given samples from two distributions p and q over an n-element set, we wish to distinguish whether p=q versus p is at least eps-far from q in either L1 or L2 distance. Batu et al. gave the first sub-linear time algorithms for these problems, which matched the lower bounds of Valiant up to a logarithmic factor in n, and a polynomial factor of eps. In this work, we present simple (and new) testers for both the L1 and L2 settings, with sample complexity that is information-theoretically optimal, to constant factors, both in the dependence on n and the dependence on eps; for the L1 testing problem we establish that the sample complexity is Theta(max[n^(2/3)/eps^(4/3), n^(1/2)/eps^2]).
@inproceedings{CDVV14,
title = "Optimal algorithms for testing closeness of discrete distributions",
author = "Siu-On Chan and Ilias Diakonikolas and Gregory Valiant and Paul Valiant ",
booktitle="SODA (to appear)",
year = "2014",
}
We study the question of learning a sparse multivariate polynomial over a real domain. In particular, for some unknown polynomial f(x) of degree-d and k monomials, we show how to reconstruct f given only a set of examples drawn uniformly from the n-dimensional cube (or an n-dimensional Gaussian distribution), together with their evaluations. This result holds even in the noisy setting. The runtime of our algorithm is polynomial in n,k, and a function that depends only on d and the desired accuracy. Note that, in contrast, in the Boolean version of this problem where x is drawn from the hypercube, the problem is at least as hard as the problem of learning parity with noise where we do not know how to break the n^O(d) barrier.
@inproceedings{APVZ14,
title = "Learning sparse polynomial functions",
author = "Alexandr Andoni and Rina Panigrahy and Gregory Valiant and Li Zhang",
booktitle="SODA",
year = "2014",
}
Recently, we showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). In this work, we propose a novel modification of the previous approach and show: 1) theoretically, the estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in this approach is to first use the sample to characterize the ``unseen'' portion of the distribution. This goes beyond such tools as the Good- Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems.
@inproceedings{VV13a,
title = "Estimating the unseen: improved estimators for entropy and other properties",
author = "Gregory Valiant and Paul Valiant",
booktitle="NIPS",
year = "2013",
}
Given a set of n random d-dimensional Boolean vectors with the promise that they are chosen uniformly at random with the exception of two vectors that have Pearson-correlation r>0 with each other, how quickly can one find the two correlated vectors? We present an algorithm which, for any constant r > 0 runs in (expected) time dn^((5-w)/(4-w)) < dn^(1.62), where w < 2.4 is the exponent of matrix multiplication. Provided the dimensionality, d, is sufficiently large, this exponent can be further reduced. These are the first subquadratic--time algorithms for this problem for which r does not appear in the exponent of n, and improves upon O(dn^(2 - O(r))) given by Paturi et al., the `Locality Sensitive Hashing' approach of Indyk et al., and the `Bucketing Codes' approach of Dubiner.
Applications and extensions of this basic algorithm yield improved algorithms for several other problems, including the approximate closest pair problem (in both Hamming and Euclidean metrics), learning sparse parities with noise, and learning juntas both with and without noise.
@article{V2015,
author = {Valiant, Gregory},
title = {Finding Correlations in Subquadratic Time, with Applications to Learning Parities and the Closest Pair Problem},
journal = {J. ACM},
issue_date = {May 2015},
volume = {62},
number = {2},
month = may,
year = {2015},
pages = {13:1--13:45},
url = {http://doi.acm.org/10.1145/2728167},
doi = {10.1145/2728167},
acmid = {2728167},
}
For a broad class of practically relevant distribution properties, which includes entropy and support size, nearly all of the proposed estimators have an especially simple form. Given a set of independent samples from a discrete distribution, these estimators tally the vector of summary statistics---the number of domain elements seen once, twice, etc. in the sample---and output the dot product between these summary statistics, and a fixed vector of coefficients. We term such estimators linear. This historical proclivity towards linear estimators is slightly perplexing, since, despite many efforts over nearly 60 years, all proposed such estimators have significantly suboptimal convergence, compared to the bounds shown in [Valiant and Valiant, 2011].
Our main result, in some sense vindicating this insistence on linear estimators, is that for any property in this broad class, there exists a near-optimal linear estimator. Additionally, we give a practical and polynomial-time algorithm for constructing such estimators for any given parameters.
While this result does not yield explicit bounds on the sample complexities of these estimation tasks, we leverage the insights provided by this result, to give explicit constructions of near-optimal linear estimators for three properties: entropy, L1 distance to uniformity, and for pairs of distributions, L1 distance.
Our entropy estimator, when given O(n/c log n) independent samples from a distribution of support at most n, will estimate the entropy of the distribution to within accuracy c, with probability of failure o(1/poly(n)). From the recent lower bounds given in [Valiant and Valiant, 2011], this estimator is optimal, to constant factor, both in its dependence on n, and its dependence on c. In particular, the inverse-linear convergence rate of this estimator resolves the main open question of [Valiant and Valiant, 2011], which left open the possibility that the error decreased only with the square root of the number of samples.
Our distance to uniformity estimator, when given O(m/c^2 log m)$ independent samples from any distribution, returns an c-accurate estimate of the L1 distance to the uniform distribution of support m. This is the first sublinear-sample estimator for this problem, and is constant-factor optimal, for constant c.
Finally, our framework extends naturally to properties of pairs of distributions, including estimating the L1 distance and KL-divergence between pairs of distributions. We give an explicit linear estimator for estimating L1 distance to accuracy c using O(n/c^2 log n) samples from each distribution, which is constant-factor optimal, for constant c.
@inproceedings{linEstVV,
title = "The Power of Linear Estimators",
author = "Gregory Valiant and Paul Valiant",
booktitle="the 52nd Annual IEEE Symposium on the Foundations of Computer Science (FOCS)",
year = "2011",
}
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear-sample additive estimators for a class of properties that includes entropy and distribution support size. Together with the lower bounds proven in the companion paper [VV'10a], this settles the longstanding question of the sample complexities of these estimation problems (up to constant factors). Our algorithm estimates these properties up to an arbitrarily small additive constant, using O(n log n) samples; [VV'10a] shows that no algorithm on o(n log n) samples can achieve this (where n is a bound on the support size, or in the case of estimating the support size, 1/n is a lower bound the probability of any element of the domain). Previously, no explicit sublinear-sample algorithms for either of these problems were known. Additionally, our algorithm runs in time linear in the number of samples used.
@article{unseenVV,
title = "Estimating the unseen: a sublinear-sample canonical estimator of distributions",
author = "Gregory Valiant and Paul Valiant",
year = "2010",
}
We prove two new multivariate central limit theorems; the first relates the sum of independent distributions to the multivariate Gaussian of corresponding mean and covariance, under the earthmover distance matric (also known as the Wasserstein metric). We leverage this central limit theorem to prove a stronger but more specific central limit theorem for ìgeneralized multinomialî distributionsóa large class of discrete distributions, parameterized by matrices, that generalize binomial and multinomial distributions, and describe many distributions encountered in computer science. This central limit theorem relates a generalized multinomial distribution to a multivariate Gaussian distribution, discretized by rounding to the nearest lattice points. In contrast to the metric of our first central limit theorem, this bound is in terms of statistical distance, which immediately implies that any algorithm with input drawn from a generalized multinomial distribution behaves essentially as if the input were drawn from a discretized Gaussian with the same mean and covariance. Such tools in the multivariate setting are rare, and we hope this new tool will be of use to the community.
In the second part of the paper, we employ this central limit theorem to establish a lower bound of Omega(n/ log n) on the sample complexity of additively estimating the entropy or support size of a distribution (where 1/n is a lower bound on the probability of any element in the domain). Together with the canonical estimator constructed in the companion paper [VV'10b], this settles the longstanding open question of the sample complexities of these estimation problems, up to constant factors. In particular, for any constant c, there is a pair of distributions D, D' each of whose elements occurs with probability at least 1/n, and whose entropies satisfy H(D)-H(D') > c, such that no algorithm on o( n / c log n) samples can distinguish D from D' with probability greater than 2/3, and analogously for the problem of estimating the support size. The previous lower-bounds on these sample complexities were n/2^Theta(Sqrt(log n)), for constant c. We explicitly exhibit such a pair of distributions via a Laguerre polynomial construction that may be of independent interest.
@article{cltVV,
title = "A CLT and tight lower bounds for estimating entropy",
author = "Gregory Valiant and Paul Valiant",
year = "2010",
}
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size. Additionally, we show new matching lower bounds. Together, this settles the longstanding question of the sample complexities of these estimation problems, up to constant factors.
@inproceedings{unseenVV,
title = "Estimating the Unseen: An n/log(n)-sample Estimator for Entropy and Support Size, Shown Optimal via New CLTs",
author = "Gregory Valiant and Paul Valiant",
booktitle="the 43rd ACM Symposium on Theory of Computing (STOC)",
year = "2011",
}
Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We give an algorithm for this problem that has a running time, and data requirement polynomial in the dimension and the inverse of the desired accuracy, with provably minimal assumptions on the Gaussians. Our algorithm recovers parameters such that each recovered component is close in statistical distance to one of the original components--a condition that is stronger than simply additive approximations of the parameters. As simple consequences of our learning algorithm, we can perform near-optimal clustering of the sample points and density estimation for mixtures of k Gaussians, efficiently. The building blocks of our algorithm are based on the work Kalai et al. [STOC 2010] that gives an efficient algorithm for learning mixtures of two Gaussians by considering a series of projections down to one dimension, and applying the method of moments to each univariate projection. A major technical hurdle in Kalai et al. is showing that one can efficiently learn univariate mixtures of two Gaussians. In contrast, because pathological scenarios can arise when considering univariate projections of mixtures of more than two Gaussians, the bulk of the work in this paper concerns how to leverage an algorithm for learning univariate mixtures (of many Gaussians) to yield an efficient algorithm for learning in high dimensions. Our algorithm employs hierarchical clustering and rescaling, together with delicate methods for backtracking and recovering from failures that can occur in our univariate algorithm. Finally, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary.
@inproceedings{MV:many_gaussians,
title = "Settling the Polynomial Learnability of Mixtures of Gaussians",
author = "Ankur Moitra and Gregory Valiant",
booktitle = "the 51st Annual Symposium on the Foundations of Computer Science (FOCS)",
year = "2010",
}
Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We provide a polynomial-time algorithm for this problem for the case of two Gaussians in n dimensions (even if they overlap), with provably minimal assumptions on the Gaussians, and polynomial data requirements. In statistical terms, our estimator converges at an inverse polynomial rate, and no such estimator (even exponential time) was known for this problem (even in one dimension). Our algorithm reduces the n-dimensional problem to the one-dimensional problem, where the method of moments is applied. One technical challenge is proving that noisy estimates of the first six moments of a univariate mixture suffice to recover accurate estimates of the mixture parameters, as conjectured by Pearson (1894), and in fact these estimates converge at an inverse polynomial rate. As a corollary, we can efficiently perform near-optimal clustering: in the case where the overlap between the Gaussians is small, one can accurately cluster the data, and when the Gaussians have partial overlap, one can still accurately cluster those data points which are not in the overlap region. A second consequence is a polynomial-time density estimation algorithm for arbitrary mixtures of two Gaussians, generalizing previous work on axis-aligned Gaussians (Feldman et al, 2006).
@inproceedings{KMV:two_gaussians,
title = "Efficiently Learning Mixtures of Two Gaussians",
author = "Adam Tauman Kalai and Ankur Moitra and Gregory Valiant",
booktitle = "the 42nd ACM Symposium on Theory of Computing (STOC)",
year = "2010",
}
This paper provides new worst-case bounds for the size and treewith of the result Q(D) of a conjunctive query Q applied to a database D. We derive bounds for the result size |Q(D)| in terms of structural properties of Q, both in the absence and in the presence of keys and functional dependencies. These bounds are based on a novel ``coloring'' of the query variables that associates a coloring number C(Q) to each query Q. Intuitively, each color used represents some possible entropy of that variable. Using this coloring number, we derive tight bounds for the size of Q(D) in case (i) no functional dependencies or keys are specified, and (ii) simple functional dependencies (keys) are given. These results generalize recent size-bounds for join queries obtained by Atserias et al. [2008]. In the case of arbitrary (compound) functional dependencies, we use tools from information theory to provide lower and upper bounds, establishing a close connection between size bounds and a basic question in information theory. Our new coloring scheme also allows us to precisely characterize (both in the absence of keys and with simple keys) the treewidth-preserving queries—the queries for which the treewidth of the output relation is bounded by a function of the treewidth of the input database. Finally, we give some results on the computational complexity of determining the size bounds, and of deciding whether the treewidth is preserved.
This paper extends the work of Gottlob, Lee, and Valiant (PODS 2009)[GLV], and considers worst-case bounds for the size of the result Q(D) of a conjunctive query Q to a database D given an arbitrary set of functional dependencies. The bounds in [GLV] are based on a "coloring" of the query variables. In order to extend the previous bounds to the setting of arbitrary functional dependencies, we leverage tools from information theory to formalize the original intuition that each color used represents some possible entropy of that variable, and bound the maximum possible size increase via a linear program that seeks to maximize how much more entropy is in the result of the query than the input. This new view allows us to precisely characterize the entropy structure of worst-case instances for conjunctive queries with simple functional dependencies (keys), providing new insights into the results of [GLV]. We extend these results to the case of general functional dependencies, providing upper and lower bounds on the worst-case size increase. We identify the fundamental connection between the gap in these bounds and a central open question in information theory. Finally, we show that, while both the upper and lower bounds are given by exponentially large linear programs, one can distinguish in polynomial time whether the result of a query with an arbitrary set of functional dependencies can be any larger than the input database.
@inproceedings{VV:size_bds,
title = "Size Bounds for Conjunctive Queries with General Functional Dependencies",
author = "Gregory Valiant and Paul Valiant",
year = "2010",
booktitle = "http://arxiv.org/abs/0909.2030v2",
}
This paper provides new worst-case bounds for the size and treewith of the result Q(D) of a conjunctive query Q to a database D. We derive bounds for the result size |Q(D)| in terms of structural properties of Q, both in the absence and in the presence of keys and functional dependencies. These bounds are based on a novel ìcoloringî of the query variables that associates a coloring number C(Q) to each query Q. Using this coloring number, we derive tight bounds for the size of Q(D) in case (i) no functional dependencies or keys are specified, and (ii) simple (one-attribute) keys are given. These results generalize recent size bounds for join queries obtained by Atserias, Grohe, and Marx (FOCS 2008). An extension of our coloring technique also gives a lower bound for |Q(D)| in the general setting of a query with arbitrary functional dependencies. Our new coloring scheme also allows us to precisely characterize (both in the absence of keys and with simple keys) the treewidth-preserving queriesó-the queries for which the output treewidth is bounded by a function of the input treewidth. Finally we characterize the queries that preserve the sparsity of the input in the general setting with arbitrary functional dependencies.
@inproceedings{GLV:size_bounds,
title = "Size and Treewidth Bounds for Conjunctive Queries",
author = "Georg Gottlob and Stephanie Tien Lee and Gregory Valiant",
year = "2009",
booktitle = "PODS 2009",
}
We present a new framework for auction design and analysis that we term ìbest-response auctionsî. We use this frame work to show that the simple and myopic best-response dy namics converge to the VCG outcome and are incentive compatible in several well-studied auction environments (Generalized Second Price auctions, and auctions with unit-demand bidders). Thus, we establish that in these environments, given that all other bidders are repeatedly best-responding, the best course of action for a bidder is to also repeatedly best-respond. Our results generalize classical results in economics regarding convergence to equilibrium and incentive compatibility of ascending-price English auctions. In addition, our findings provide new game-theoretic justifications for some well-studied auction rules. Best-response auctions provide a way to bridge the gap between the full-information equilibrium concept and the usual private-information auction theory.
@inproceedings{NSVZ:BRA,
title = "Best-Response Auctions",
author = "Noam Nisan and Michael Schapira and Gregory Valiant and Aviv Zohar",
booktitle = "EC 2011.",
}
We revisit price of anarchy in network routing, in a new model in which routing decisions are made by self-interested components of the network, as opposed to by the flows as in the work of Roughgarden and Tardos [RT02]. This significant departure from previous work on the problem seeks to model Internet routing more accurately. We propose two models: the latency model in which the network edges seek to minimize the average latency of the flow through them on the basis of knowledge of latency conditions in the whole network, and the pricing model in which network edges advertise pricing schemes to their neighbors and seek to maximize their profit. We show two rather surprising results: the price of stability in the latency model is unbounded, even with linear latencies (as compared with 4/3 in [RT02] for the case in which routing decisions are made by the flows themselves). However, in the pricing model in which edges advertise pricing schemesóhow the price varies as a function of the total amount of flowówe show that, under a condition ruling out monopolistic situations, all Nash equilibria have optimal flows; that is, the price of anarchy in this model is one, in the case of linear latencies with no constant term.
@inproceedings{PV:selfish_routers,
title = "A New Look at Selfish Routing",
author = "Christos Papadimitriou and Gregory Valiant",
year = "2010",
booktitle = "Innovations in Computer Science (ICS2010)",
}
Under many protocolsóin computerized settings and in economics settingsóparticipants repeatedly ìbest respondî to each othersí actions until the system ìconvergesî to an equilibrium point. We ask when does such myopic ìlocal rationalityî imply ìglobal rationalityî, i.e., when is it best for a player, given that the others are repeatedly best-responding, to also repeatedly best-respond? We exhibit a class of games where this is indeed the case. We identify several environments of interest that fall within our class: models of the Border Gateway Protocol (BGP), that handles routing on the Internet, and of the Transmission Control Protocol (TCP), and also stable-roommates and cost-sharing, that have been extensively studied in economic theory.
@inproceedings{NSVZ:best_reply,
title = "Best-Response Mechanisms",
author = "Noam Nisan and Michael Schapira and Gregory Valiant and Aviv Zohar",
booktitle = "Innovations in Computer Science (ICS 2011), 2011",
}
Can learning algorithms find a Nash equilibrium? This is a natural question for several reasons. Learning algorithms resemble the behavior of players in many naturally arising games, and thus results on the convergence or non- convergence properties of such dynamics may inform our understanding of the applicability of Nash equilibria as a plausible solution concept in some settings. A second reason for asking this question is in the hope of being able to prove an impossibility result, not dependent on complexity assumptions, for computing Nash equilibria via a restricted class of reasonable algorithms. In this work, we begin to answer this question by considering the dynamics of the standard multiplicative weights update learning algorithms (which are known to converge to a Nash equilibrium for zero-sum games). We revisit a 3x3 game defined by Shapley in the 1950s in order to establish that fictitious play does not converge in general games. For this simple game, we show via a potential function argument that in a variety of settings the multiplicative updates algorithm impressively fails to find the unique Nash equilibrium, in that the cumulative distributions of players produced by learning dynamics actually drift away from the equilibrium.
@inproceedings{DFPPV10,
title = "On Learning Algorithms for Nash Equilibria",
author = "C. Daskalakis and R. Frongillo and C. H. Papadimitriou and G. Pierrakos and G. Valiant",
year = "2010",
booktitle = "SAGT",
}
In light of much recent interest in finding a model of multi-player multi-action games that allows for efficient computation of Nash equilibria yet remains as expressive as possible, we investigate the computational complexity of Nash equilibria in the recently proposed model of action- graph games (AGGs). AGGs, introduced by Bhat and Leyton-Brown, are succinct representations of games that encapsulate both local dependencies as in graphical games, and partial indifference to other agents' identities as in anonymous games, which occur in many natural settings such as financial markets. This is achieved by specifying a graph on the set of actions, so that the payoff of an agent for selecting a strategy depends only on the number of agents playing each of the neighboring strategies in the action graph. We present a simple Fully Polynomial Time Approximation Scheme for computing mixed Nash equilibria of AGGs with constant degree, constant treewidth and a constant number of agent types (but an arbitrary number of strategies), and extend this algorithm to a broader set of instances. However, the main results of this paper are negative, showing that when either of the latter conditions are relaxed the problem becomes intractable. In particular, we show that even if the action graph is a tree but the number of agent-types is unconstrained, it is NP--complete to decide the existence of a pure-strategy Nash equilibrium and PPAD--complete to compute a mixed Nash equilibrium (even an approximate one). Similarly for AGGs with a constant number of agent types but unconstrained treewidth. These hardness results suggest that, in some sense, our FPTAS is as strong a positive result as one can expect. In the broader context of trying to pin down the boundary where the equilibria of multi-player games can be computed efficiently, these results complement recent hardness results for graphical games and algorithmic results for anonymous games.
@inproceedings{DSVV:actiongraph08,
title = "On the Complexity of Nash Equilibria of Action-Graph Games",
author = "Constantinos Daskalakis and Grant Schoenebeck and Gregory Valiant and Paul Valiant",
year = "2009",
booktitle = "SODA",
}
Designing and deploying a network protocol determines the rules by which end users interact with each other and with the network. We consider the problem of designing a protocol to optimize the equilibrium behavior of a network with selfish users. We consider network cost--sharing games, where the set of Nash equilibria depends fundamentally on the choice of an edge cost-sharing protocol. Previous research focused on the Shapley protocol, in which the cost of each edge is shared equally among its users. We systematically study the design of optimal cost-sharing protocols for undirected and directed graphs, single-sink and multicommodity networks, and different measures of the inefficiency of equilibria. Our primary technical tool is a precise characterization of the cost-sharing protocols that induce only network games with pure-strategy Nash equilibria. We use this characterization to prove, among other results, that the Shapley protocol is optimal in directed graphs and that simple priority protocols are essentially optimal in undirected graphs.
@article{CRV:good equilibria,
author = "Ho-Lin Chen and Tim Roughgarden and Gregory Valiant",
title = "Designing Network Protocols for Good Equilibria",
journal="SIAM Journal on Computing",
volume="39",
number ="5",
pages="1799--1832",
year = "2010",
}
Braess's Paradox is the counterintuitive but well-known fact that removing edges from a network with "selfsh routing" can decrease the latency incurred by traffic in an equilibrium flow. Despite the large amount of research motivated by Braess's Paradox since its discovery in 1968, little is known about whether it is a common real-world phenomenon, or a mere theoretical curiosity. In this paper, we show that Braess's Paradox is likely to occur in a natural random network model. More precisely, with high probability, (as the number of vertices goes to infinity), there is a traffic rate and a set of edges whose removal improves the latency of trafficc in an equilibrium flow by a constant factor. Our proof approach is robust and shows that the "global" behavior of an equilibrium flow in a large random network is similar to that in Braess's original four-node example.
@artticle{VR_2006,
author = "Gregory Valiant and Tim Roughgarden",
title = "Braess's paradox in large random graphs",
booktitle = "Random Structures and Algorithms".
volume="37",
number = "4",
pages ="495--515",
year = "2010",
}
During undergrad I did some work in Planetary Science, studying Martian impact craters.
The geometry of simple impact craters reflects the properties of the target materials, and the diverse range of fluidized morphologies observed in Martian ejecta blankets are controlled by the near-surface composition and the climate at the time of impact. Using the Mars Orbiter Laser Altimeter (MOLA) data set, quantitative information about the strength of the upper crust and the dynamics of Martian ejecta blankets may be derived from crater geometry measurements. Here we present the results from geometrical measurements of fresh craters 3.50 km in rim diameter in selected highland (Lunae and Solis planitiae) and lowland (Acidalia, Isidis, and Utopia planitiae) terrains. We find large, resolved differences between the geometrical properties of the freshest highland and lowland craters. Simple lowland craters are 1.5-2.0 times deeper with >50% larger cavities compared to highland craters of the same diameter. Rim heights and the volume of material above the preimpact surface are slightly greater in the lowlands over most of the size range studied. The different shapes of simple highland and lowland craters indicate that the upper ~6.5 km of the lowland study regions are significantly stronger than the upper crust of the highland plateaus. Lowland craters collapse to final volumes of 45-70% of their transient cavity volumes, while highland craters preserve only 25-50%. The effective yield strength of the upper crust in the lowland regions falls in the range of competent rock, approximately 9.12 MPa, and the highland plateaus may be weaker by a factor of 2 or more, consistent with heavily fractured Noachian layered deposits. The measured volumes of continuous ejecta blankets and uplifted surface materials exceed the predictions from standard crater scaling relationships and MaxwellÅfs Z model of crater excavation by a factor of 3. The excess volume of fluidized ejecta blankets on Mars cannot be explained by the concentration of ejecta through nonballistic emplacement processes and/or bulking. The observations require a modification of the scaling laws and are well fit using a scaling factor of ~1.4 between the transient crater surface diameter to the final crater rim diameter and excavation flow originating from one projectile diameter depth with Z = 2.7. The refined excavation model provides the first observationally constrained set of initial parameters for study of the formation of fluidized ejecta blankets on Mars.
@article{Stewart_Valiant_2006,
author = "S. T. Stewart and G. J. Valiant",
title = "Martian subsurface properties and crater formation processes inferred from fresh impact crater geometries",
journal = "Meteoritics and Planetary Sciences",
year = 2006,
volume = 41,
number = 10,
pages = "1509-1537",
}