Simsets: Simflix viewing data
Simflix is an imaginary video-on-demand service that shows hundreds of movies, covering three genres:
sci-fi, action
and period costume drama.
We have simulated some viewing data for you. The data set, which can be downloaded from here consists of a matrix with a row for every viewer and a column for
every movie. A one rather than a zero indicates that the viewer watched the movie.
Try refreshing the page to generate another data set!
The data generating function
The probability of viewer $i$ watching movie $j$ is given by:
$$p_{i, j} = \text{logit}(\mu_{i,j})$$
Where
$$ \mu_{i,j} = \beta_0 + \sum_{k=1}^3 \beta_{i, k} g_{k,j} + \beta_{4, i} q_j$$
And
- $\beta_{0,i}$ the baseline log odds of watching a movie for viewer $i$
- $g_{k, j}$ is an indicator variable which is one if the movie is in the kth genre and zero
otherwise. The genres are coded as 1=sci-fi, 2=action and 3=period costume drama.
- $\beta_{i, k}$ is the log odds multiplier that is applied when viewer $i$ watches genre $k$.
- $q_j$ is a quality score for movie $j$
- $\beta_{4, i}$ controls how viewer $i$ reacts to programme quality.
The betas are drawn from a multivariate normal distribution so that:
$$\mathbf{\beta} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$$
where:
$\boldsymbol{\mu} = [-6.34, 2.58, 1.59, 0.46, 0.97]^T$ is the mean vector and
$\Sigma = \begin{bmatrix}
0.87 & 0.35 & 0.21 & 0.27 & 0.31 & \\0.35 & 0.92 & 0.31 & 0.25 & 0.10 & \\0.21 & 0.31 & 0.84 & 0.25 & 0.33 & \\0.27 & 0.25 & 0.25 & 0.69 & 0.32 & \\0.31 & 0.10 & 0.33 & 0.32 & 0.81 & \\
\end{bmatrix}$ is the covariance matrix.
The corresponding correlation matrix is:
$P = \begin{bmatrix}
1.00 & 0.04 & -0.51 & -0.28 & -0.15 & \\0.04 & 1.00 & -0.09 & -0.32 & -0.79 & \\-0.51 & -0.09 & 1.00 & -0.37 & -0.03 & \\-0.28 & -0.32 & -0.37 & 1.00 & 0.02 & \\-0.15 & -0.79 & -0.03 & 0.02 & 1.00 & \\
\end{bmatrix}$
The $q_j$ are drawn from a Beta distribution with parameters $\alpha=2$ and $\beta=5$.
Use cases
With simulated data we know the underlying process that generated that data. This makes it very useful
for tasks where we need to know the right answer. For simsets data these include
- Testing methods for recovering latent variables - in this case the genres
- Testing recommendation algorithms
- Testing multivariate data visualistion techniques
- Creating interview questions
- Creating examples for teaching
Note, if you include the parameter output_type=json in your call to the API then the
endpoint will return the viewing data, the
$\beta$, $\Sigma$, $P$ and some latex for describing the generating model.