go to SAMSI home page

19 T.W. Alexander Drive
P.O.Box 14006

Research Triangle Park, NC 27709-4006

Tel: 919.685.9350
Fax: 919.685.9360
info@samsi.info

 

Algebraic Statistics and Experimental Design Working Group
Meeting Summary Page

September
October
November
December

September 2008

September 29

I'll talk about some recent work with E. Allman and C. Matias on identifiability of statistical models with hidden variables. Many statistical models incorporate hidden (unobservable) variables with some sort of structure in which observed variables are independent when conditioned on the hidden ones. Examples include simple latent class models, including those with continuous observed variables, hidden Markov models, phylogenetic models, and a random graph model. Proving such models are identifiable is generally difficult. However, a theorem of J. Kruskal provides a good algebraic tool which can be used to address the question for all these models (and presumably more). After an introduction to why the identifiability question is difficult for such models, and a presentation of Kruskal's theorem, I'll explain the method and illustrate its use on HMMs. Though we mostly recover known theorems for HMMs, the general method is quite powerful and gives both stronger results and easier proofs for many other models.

 

 

October 2008

October 20,

We will provide a full semi-algebraic description of  graphical models for binary data in the case when the graph is a tree and all the inner nodes are hidden.  It occurs that the inequalities are just simple constraints on correlations between the manifest variables. Our work suggests that exploring the semi-algebraic structure of statistical models will probably involve using coordinate systems other than the raw probabilities.

October 27

Techniques from algebraic geometry were introduced in design of experiments by Pistone and Wynn (1996). The techniques they introduced identify models of a low degree and provide a natural extension to basic concepts such as confounding and fractions of designs.

 I intend to describe the basic ideas of the algebraic method, as well as presenting some of our recent work on aberration which leads to Nyquist-type lower bounds for models identified with algebra.

 

November 2008

November 10

I will describe a range of new finiteness results for statistical models, in particular, results which say that, up to symmetry, many models in random variables with state space "tending to infinity" have finite algebraic descriptions. The focus will be on applications of these ideas to Markov bases (that is, generating sets of toric ideals), but the techniques apply to many other statistical models. The results follow from studying polynomial rings in infinitely many indeterminates under the action of the infinite symmetric group, which leads to a theory that is of independent interest.  I will also mention a bunch of open problems and research directions.  This is joint work with Chris Hillar.

November 17

The p1 model describes dyadic interactions in a social network, summarized in the form of a directed graph.  It can be represented as a log-linear model.  The main goal of this project (joint work with S. Fienberg and A. Rinaldo) is to understand this model from the viewpoint of algebraic statistics: describe Markov bases, study the MLE etc.
In this talk, we will recall what Markov bases are and why we might want to compute them.  Then, we will reveal an algebraic-geometric description of the Markov bases.  Finally, to extract the statistically meaningful part of this formal description is still work in progress, and we might see a glimpse of combinatorics coming into play.

 

December 2008

December 8

 

I will aim to connect these three notions as a means to study properties of multidimensional contingency tables, with attention to statistical disclosure limitation (SDL). Mathematical models will be given for the principal SDL methods for tables: controlled rounding and perturbation, complementary cell suppression, and controlled tabular adjustment. Each method relies on the existence of sufficiently many alternative tables (solutions) to the original table, or existence of alternative tables sufficiently far from the original, or both. We are interested in moves between solutions and Markov bases, but instead of dealing directly with Markov or Groebner bases I will introduce an equivalent notion in linear programming. I will discuss a class of tables modeled as networks and, consequently, whose minimal Markov moves are square-free and easily computed. I will characterize Markov moves in general tables as linear programs restricted to a unit hypercube, and illustrate this notion for complete independence log linear models, two-way tables with subtotal constraints, and other non-network examples. Based on the De Loera-Onn characterization of rational polytopes as transportation polytopes, I will open discussion on the utility (or not) of focusing study of Markov moves in tables to thin (network) and slim (non-network) 3-dimensional planar transportation problems (aka no 3-way interaction log linear models). Small (3x3x3) examples will be used as illustrations.

 

February 2009

February 23

 

A partially specified contingency table is the set of all contingency tables satisfying fixed marginal totals (minimal sufficient statistics). We present theory and methods for drawing a random sample of tables from this set based on linear programming. Diaconis and Sturmfels (1998) presented the original method which is based on uniform random sampling from a Markov basis for the set of all feasible moves between tables. The Markov basis is obtained from a corresponding Groebner basis, which consequently must be available prior to sampling. Our approach is to construct random moves one at a time based on a linear programming formulation that assures irreducibility and reversibility of the Markov chain. Consequently, the set of all moves constructible by our procedure is a Markov basis. The method has the following features: it does not require a Groebner basis; it is not affected by complexities of table design or unduly limited by table size; and, by virtue of an additional linear constraint specific to the current solution, it does not create infeasible moves that must be immediately discarded. We illustrate related points through specific examples.

 

 

April 2009

 

April 6

The main focus of this talk is on partial information release strategy in which data owners due to confidentiality reasons may opt to release
relevant marginal and conditional tables along with sample size instead of a full contingency table. We will discuss some results on calculation of bounds on cell entries, and on problems of counting and sampling of the tables from a fiber defined by given observed conditional values. We will draw comparisons between marginal and conditional table releases, and discuss implications for disclosure risk assessment and data utility.

 

 

 

Working Group Home Page

Entire site © 2002-2008, Statistical and Applied Mathematical Sciences Institute. All Rights Reserved.

This page updated on September 19, 2008 10:57 AM