Simsets: Simflix viewing data

Simflix is an imaginary video-on-demand service that shows hundreds of movies, covering three genres: sci-fi, action and period costume drama.

We have simulated some viewing data for you. The data set, which can be downloaded from here consists of a matrix with a row for every viewer and a column for every movie. A one rather than a zero indicates that the viewer watched the movie.

Try refreshing the page to generate another data set!

The data generating function

The probability of viewer $i$ watching movie $j$ is given by: $$p_{i, j} = \text{logit}(\mu_{i,j})$$ Where $$ \mu_{i,j} = \beta_0 + \sum_{k=1}^3 \beta_{i, k} g_{k,j} + \beta_{4, i} q_j$$ And The betas are drawn from a multivariate normal distribution so that: $$\mathbf{\beta} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$$ where: $\boldsymbol{\mu} = [-6.34, 2.58, 1.59, 0.46, 0.97]^T$ is the mean vector and

$\Sigma = \begin{bmatrix} 0.87 & 0.35 & 0.21 & 0.27 & 0.31 & \\0.35 & 0.92 & 0.31 & 0.25 & 0.10 & \\0.21 & 0.31 & 0.84 & 0.25 & 0.33 & \\0.27 & 0.25 & 0.25 & 0.69 & 0.32 & \\0.31 & 0.10 & 0.33 & 0.32 & 0.81 & \\ \end{bmatrix}$ is the covariance matrix.

The corresponding correlation matrix is:

$P = \begin{bmatrix} 1.00 & 0.04 & -0.51 & -0.28 & -0.15 & \\0.04 & 1.00 & -0.09 & -0.32 & -0.79 & \\-0.51 & -0.09 & 1.00 & -0.37 & -0.03 & \\-0.28 & -0.32 & -0.37 & 1.00 & 0.02 & \\-0.15 & -0.79 & -0.03 & 0.02 & 1.00 & \\ \end{bmatrix}$

The $q_j$ are drawn from a Beta distribution with parameters $\alpha=2$ and $\beta=5$.

Use cases

With simulated data we know the underlying process that generated that data. This makes it very useful for tasks where we need to know the right answer. For simsets data these include

Note, if you include the parameter output_type=json in your call to the API then the endpoint will return the viewing data, the $\beta$, $\Sigma$, $P$ and some latex for describing the generating model.