# Primer on Data Science 2019

# Trento, 9-11 September 2019

# Aim

**Primer on Data Science** is a serie of summer schools organized by the curriculum **Mathematics and Statistics for Life and Social Sciences** of the Laurea Magistrale in Mathematics (Department of Mathematics, University of Trento), to the aim of introducing third year bachelor students and bachelor graduates to the topics of this curriculum. Every year the school will have a different topic.

The 2019 edition will focus on a gentle introduction to some aspects of **Uncertainty quantification and applications**.

# Where

All the activities are in room A102 of Polo Scientifico e Tecnologico “Fabio Ferrari”, Povo 1, see here

# Admission

## Students

The school is open to 30 participants, no fees are required, but registration is mandatory. Everybody is welcome to apply, however, admission will be based on the following criteria in order of importance

- Bachelor graduates and third year bachelor students in Mathematics
- Transcript of Records and grades
- Students from University of Trento

Lectures will be delivered in English.

**Application is close**

## Professionals

PDS2019 is also open to professionals and companies: for further information (fee, registration form etc) please contact Francesca Stanca.

## Participation in the school includes

- Access to the material: notes, slides, videos of the courses
- Coffee breaks
- Access to the university canteen (lunch time)

# Lectures

**Lorenzo Tamellini**(CNR-IMATI, Pavia, Italy)

Lorenzo Tamellini is a researcher at CNR-IMATI, Pavia (Italy). His research topics are numerical methods for PDEs (in particular, Isogeometric Analysis) and Uncertainty Quantification (UQ). In details, he works on polynomial approximation methods for UQ (mostly stochastic collocation on sparse grids and multi-index stochastic collocation), with applications to groundwater flows and geochemical compaction problems, for both direct and inverse UQ problems. He maintains the Matlab library Sparse Grids Matlab kit. More details can be found at arturo.imati.cnr.it/tamellini/

**Marco Broccardo**(Swiss Seismological Service, ETH Zürich, Switzerland)

Marco Broccardo is a Senior Researcher and Lecturer at the Swiss Seismological Service (SED), at ETH Zürich (Switzerland). His research interests focus on developing computational probabilistic and statistical tools for system reliability analysis, earthquake engineering, and stochastic dynamics. Currently, he is engaged in three main research projects: i) probabilistic characterization of fluid-induced seismicity (ii), Hamiltonian Monte Carlo for rare event estimations, (iii), probabilistic system resilience analysis. Dr. Marco Broccardo obtained his Ph.D. in the Structural Engineering, Mechanics, and Materials (SEMM) program with designated emphasis in Computational Science at the University of California, Berkeley, where he also completed the minors in Statistics and Computational Mechanics. For more information visit www.marco-broccardo.com

**Roberta Sirovich**(Dipartimento di Matematica “Giuseppe Peano”, University of Turin, Turin, Italy)

Roberta Sirovich is Assistant Professor at the University of Torino, Department of Mathematics. Her education is in Mathematics. Her research interests shifted to applied probability and statistics. Her main topics are: 1. stochastic processes as models for observable phenomena (neuronal firing, cancer growth, supply chain, chemical reaction networks, population dynamics), 2. estimation for stochastic processes (regularization methods, asymptotic properties for estimation from diffusion processes), 3. applied statistics (fashion industry data, advanced sport analytics, DNA sequencing data). For more information visit www.robertasirovich.it

# Program

## Day 1 (09/09/2019)

- [08:00- ] Registration Desk is open
- [08:15-8:30] Welcome

### Uncertainty Quantification of PDEs with random coefficients

Room A102 [8:30-10:00] [10:30-12:30] [14:00-16:30]

**Lorenzo Tamellini** (CNR-IMATI, Italy)

- Introduction to UQ, quick recap of statistics and probability, random fields
- Sparse grids (and possibly polynomial chaos) for forward UQ
- Inverse problems in UQ (maximum likelihood, inversion bayesian)
- Multi-Level Monte Carlo / Multi-Index Stochastic Collocation for forward UQ

Uncertainty Quantification (UQ) is an increasingly popular subject in computational science and engineering, that deals with assessing the impact on the PDE solution of the uncertainty on the coefficients of the same PDE. “Coefficients” here is a term that must be understood in a broad sense as “specifics of the PDE”, i.e. boundary shape, initial/boundary conditions, forcing terms, diffusion/advection/reaction coefficients etc. Uncertainty on the coefficients can be caused by measurment errors, lack of knowledge or intrinsic aleatoric behavior of such coefficients, and is typically modeled by means of random variables or random fields.

Two major computational challenges arise in this setting, namely: the fact that the solution of the PDE is only known to us through an expensive PDE solver (so we would like to minimize the times we query such solver); and the fact that the number of random variables needed to appropriately describe the randomness can be very high (from tens to hundreds, or even countable sequences), depending on the covariance structure of the randomness.

In this minicourse we tackle two relevant problems in this scenario, i.e.

- how do we compute the statistical descriptors (mean, variance, pdf) of the solution subject to the uncertainty of the coefficients (forward problem)?
- how do we improve the probabilistic description of the random coefficients given measurements of the solution (inverse problem)?

The forward problem can be recast as a high-dimensional quadrature/interpolation problem and will be solved by introducing Monte Carlo/Stochastic Collocation methods (and their advanced “Multi-Index” counterparts).

The inverse problem instead is a classic high-dimensional sampling/inference problem and will be tackled by employing standard methods such as Maximum Likelihood or Bayesian Inversion, accelerated with Stochastic Collocation surrogate models.

### References

- L. Tamellini. Polynomial approximation of PDEs with stochastic coefficients. PhD Thesis, Politecnico di Milano, 2012
- J. Beck, L. Tamellini, and R. Tempone. IGA-based Multi-Index Stochastic Collocation for random PDEs on arbitrary domains. Computer Methods in Applied Mechanics and Engineering, 351:330-350, 2019
- G. Porta, L. Tamellini, V. Lever, and M. Riva. Inverse modeling of geochemical and mechanical compaction in sedimentary basins through polynomial chaos expansion. Water Resources Research, 50(12):9414-9431, 2014.

### Material

## Day 2 (10/09/2019)

### Markov Chain Monte Carlo methods for Bayesian Learning and Uncertainty Quantification

Room A102 [9:00-10:30] [11:00-12:30] [14:00-15:30] [16:00-17:30]

*Marco Broccardo* (ETH Zürich, Switzerland)

- Recap of Bayesian statistics and Monte Carlo Simulation methods (in the context of Bayesian Learning and rare event estimation)
- Markov Chain Monte Carlo methods
- Importance Sampling, introduction to Sequential Monte Carlo (aka Subset Simulation for engineering) and applications

Bayesian learning methods are critical tools for assessing the behaviors of systems in the presence of uncertainties. A significant strength of the Bayesian approach is that it allows uncertainties and information to be encoded into a prior probability distribution. More important it enables to update this prior into a posterior distribution as soon as observations become available. The posterior distribution is then used to formulate probabilistic predictions and, ultimately, decision making under the updated state of uncertainty. Although Bayesian methods are powerful methods, they are usually computationally prohibitive. In this mini-course, we first investigate the causes which make Bayesian learning a computational challenge, and second we examine a series of Markov Chain Monte Carlo methods to overcome these hurdles. We then turn our attention on a set of algorithms for computing small probability events, which are essential computational tools for calculating the reliability of a system in the presence of uncertainty. We finally examine a real case example of Bayesian learning for anthropogenic seismicity related to deep fluid injections.

### References

- R.M. Neal. Probabilistic inference using Markov Chain Monte Carlo methods, Technical report, Department of Computer Science, University of Toronto, 1993.
- C.P. Robert and G. Casella. Introducing Monte Carlo Methods with R, Springer, New York, 2010.
- K.P. Murphy. Machine learning: a probabilistic perspective, MIT press, Cambridge, 2012.
- Z. Wang, M. Broccardo and J. Song. Hamiltonian Monte Carlo methods for Subset Simulation in reliability analysis, Structural Safety, 76:51-67, 2019.
- M. Broccardo, Z. Wang, S. Marelli, J. Song and B. Sudret. Hamiltonian Monte Carlo-based subset simulation using Gaussian process modelling, in Proc. 19th IFIP WG7. 5 Conference on Reliability and Optimization of Structural Systems, ETH Zurich (Switzerland), June 27-29, 2018.

### Material

## Day 3 (11/09/2012)

### Advanced Sports Analytics through Statistical Machine Learning

Room A102 [9:00-10:30] [11:00-12:30] [14:00-15:30] [16:00-17:30]

*Roberta Sirovich* (University of Turin, Italy)

- Estimation through regularization: how to predict shot outcome
- Point processes for spatial analysis of shooting efficiency
- Hidden Markov processes to process positional data: who’s guarding whom

Sports analytics refers to the use of data and advanced statistics to measure performance and make informed decisions, in other to gain a competitive sports advantage. It is an growing field of application for interesting modern statistical tecniques. Retracing the development of this topic will be the opportunity to approach the following main subjects:

- estimation through regularization: we will introduce statistical modeling and regularization/penalization in order to predict shot efficiency in NBA professional basketball league;
- spatial statistical analysis through Log-Gaussian Cox processes: we will process spatial data, i. e. positions of players on the court, in order to get insight on the shooting efficiency both in NBA league and in professional European football. To overcome the high correlation induced by the court’s spatial structure, we will introduce elliptical slice sampling to approximate the posterior of the discretized intensity function. Moreover, to gain understanding in the spatial component of the shooting efficiency, we will introduce a dimensionality reduction technique called Non-Negative Matrix Factorization, that produces a weill-interpretable basis.
- hidden Markov processes and EM algorithms: we construct a model to identify which offender is guarded by each defender at every moment in time in NBA league. The positions are modeled as an hidden Markov process and estimation is performed through an Expectation Maximization algorithm.

### References

- T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning; Data Mining, Inference and Prediction, 2008.
- A. Miller, L. Bornn, R. Adams, and K. Goldsberry. Factorized point process intensities: A spatial analysis of professional basketball. In International Conference on Machine Learning, 235-243, 2014.
- A. Franks, A. Miller, L. Bornn, and K. Goldsberry. Characterizing the spatial structure of defensive skill in professional basketball. The Annals of Applied Statistics, 9(1), 94-121, 2015.
- D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788, 1999.

### Material

# Organizers

- Claudio Agostinelli (claudio.agostinelli@unitn.it)
- Andrea Pugliese (andrea.pugliese@unitn.it)
- Alberto Valli (alberto.valli@unitn.it)

# Information

In case you need more information you can contact Alberto Valli (alberto.valli@unitn.it).