Extract draws from a Bayesian model into tidy data frames of random variables

Extract draws from a Bayesian model for one or more variables (possibly with named dimensions) into one of two types of long-format data frames of posterior::rvar objects.

gather_rvars(model, ..., ndraws = NULL, seed = NULL)

spread_rvars(model, ..., ndraws = NULL, seed = NULL)

Arguments

model: A supported Bayesian model fit. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.
...: Expressions in the form of variable_name[dimension_1, dimension_2, ...]. See Details.
ndraws: The number of draws to return, or NULL to return all draws.
seed: A seed to use when subsampling draws (i.e. when ndraws is not NULL).

Value

A data frame.

Details

Imagine a JAGS or Stan fit named model. The model may contain a variable named b[i,v] (in the JAGS or Stan language) with dimension i in 1:100 and dimension v in 1:3. However, the default format for draws returned from JAGS or Stan in R will not reflect this indexing structure, instead they will have multiple columns with names like "b[1,1]", "b[2,1]", etc.

spread_rvars and gather_rvars provide a straightforward syntax to translate these columns back into properly-indexed rvars in two different tidy data frame formats, optionally recovering dimension types (e.g. factor levels) as it does so.

spread_rvars will spread names of variables in the model across the data frame as column names, whereas gather_rvars will gather variable names into a single column named ".variable" and place values of variables into a column named ".value". To use naming schemes from other packages (such as broom), consider passing results through functions like to_broom_names() or to_ggmcmc_names().

For example, spread_rvars(model, a[i], b[i,v]) might return a data frame with:

column "i": value in 1:5
column "v": value in 1:10
column "a": rvar containing draws from "a[i]"
column "b": rvar containing draws from "b[i,v]"

gather_rvars(model, a[i], b[i,v]) on the same model would return a data frame with:

column "i": value in 1:5
column "v": value in 1:10, or NA on rows where ".variable" is "a".
column ".variable": value in c("a", "b").
column ".value": rvar containing draws from "a[i]" (when ".variable" is "a") or "b[i,v]" (when ".variable" is "b")

spread_rvars and gather_rvars can use type information applied to the model object by recover_types() to convert columns back into their original types. This is particularly helpful if some of the dimensions in your model were originally factors. For example, if the v dimension in the original data frame data was a factor with levels c("a","b","c"), then we could use recover_types before spread_rvars:

model %>%
 recover_types(data) 
 spread_rvars(model, b[i,v])

Which would return the same data frame as above, except the "v" column would be a value in c("a","b","c") instead of 1:3.

For variables that do not share the same subscripts (or share some but not all subscripts), we can supply their specifications separately. For example, if we have a variable d[i] with the same i subscript as b[i,v], and a variable x with no subscripts, we could do this:

spread_rvars(model, x, d[i], b[i,v])

Which is roughly equivalent to this:

spread_rvars(model, x) %>%
 inner_join(spread_rvars(model, d[i])) %>%
 inner_join(spread_rvars(model, b[i,v]))

Similarly, this:

gather_rvars(model, x, d[i], b[i,v])

Is roughly equivalent to this:

bind_rows(
 gather_rvars(model, x),
 gather_rvars(model, d[i]),
 gather_rvars(model, b[i,v])
)

The c and cbind functions can be used to combine multiple variable names that have the same dimensions. For example, if we have several variables with the same subscripts i and v, we could do either of these:

spread_rvars(model, c(w, x, y, z)[i,v])

spread_rvars(model, cbind(w, x, y, z)[i,v])  # equivalent

Each of which is roughly equivalent to this:

spread_rvars(model, w[i,v], x[i,v], y[i,v], z[i,v])

Besides being more compact, the c()-style syntax is currently also slightly faster (though that may change).

Dimensions can be left nested in the resulting rvar objects by leaving their names blank; e.g. spread_rvars(model, b[i,]) will place the first index (i) into rows of the data frame but leave the second index nested in the b column (see Examples below).

Author

Matthew Kay

Examples


library(dplyr)

data(RankCorr, package = "ggdist")

RankCorr %>%
  spread_rvars(b[i, j])
#> # A tibble: 12 × 3
#>        i     j               b
#>    <int> <int>      <rvar[1d]>
#>  1     1     1  -1.076 ± 0.095
#>  2     2     1  -0.735 ± 0.146
#>  3     3     1  -0.341 ± 0.144
#>  4     1     2  -1.821 ± 0.140
#>  5     2     2   0.101 ± 0.254
#>  6     3     2  -0.733 ± 0.229
#>  7     1     3   0.176 ± 0.071
#>  8     2     3  -0.178 ± 0.120
#>  9     3     3   0.069 ± 0.121
#> 10     1     4  -0.101 ± 0.121
#> 11     2     4   0.268 ± 0.221
#> 12     3     4   0.084 ± 0.207

# leaving an index out nests the index in the column containing the rvar
RankCorr %>%
  spread_rvars(b[i, ])
#> # A tibble: 3 × 2
#>       i          b[,1]          [,2]            [,3]           [,4]
#>   <int>     <rvar[,1]>    <rvar[,1]>      <rvar[,1]>     <rvar[,1]>
#> 1     1  -1.08 ± 0.095  -1.82 ± 0.14   0.176 ± 0.071  -0.101 ± 0.12
#> 2     2  -0.74 ± 0.146   0.10 ± 0.25  -0.178 ± 0.120   0.268 ± 0.22
#> 3     3  -0.34 ± 0.144  -0.73 ± 0.23   0.069 ± 0.121   0.084 ± 0.21

RankCorr %>%
  spread_rvars(b[i, j], tau[i], u_tau[i])
#> # A tibble: 12 × 5
#>        i     j               b         tau      u_tau
#>    <int> <int>      <rvar[1d]>  <rvar[1d]> <rvar[1d]>
#>  1     1     1  -1.076 ± 0.095  6.0 ± 0.53  5.7 ± 1.0
#>  2     2     1  -0.735 ± 0.146  3.3 ± 0.53  5.7 ± 1.5
#>  3     3     1  -0.341 ± 0.144  3.7 ± 0.52  5.1 ± 1.3
#>  4     1     2  -1.821 ± 0.140  6.0 ± 0.53  5.7 ± 1.0
#>  5     2     2   0.101 ± 0.254  3.3 ± 0.53  5.7 ± 1.5
#>  6     3     2  -0.733 ± 0.229  3.7 ± 0.52  5.1 ± 1.3
#>  7     1     3   0.176 ± 0.071  6.0 ± 0.53  5.7 ± 1.0
#>  8     2     3  -0.178 ± 0.120  3.3 ± 0.53  5.7 ± 1.5
#>  9     3     3   0.069 ± 0.121  3.7 ± 0.52  5.1 ± 1.3
#> 10     1     4  -0.101 ± 0.121  6.0 ± 0.53  5.7 ± 1.0
#> 11     2     4   0.268 ± 0.221  3.3 ± 0.53  5.7 ± 1.5
#> 12     3     4   0.084 ± 0.207  3.7 ± 0.52  5.1 ± 1.3

# gather_rvars places variables and values in a longer format data frame
RankCorr %>%
  gather_rvars(b[i, j], tau[i], typical_r)
#> # A tibble: 16 × 4
#>        i     j .variable          .value
#>    <int> <int> <chr>          <rvar[1d]>
#>  1     1     1 b          -1.076 ± 0.095
#>  2     2     1 b          -0.735 ± 0.146
#>  3     3     1 b          -0.341 ± 0.144
#>  4     1     2 b          -1.821 ± 0.140
#>  5     2     2 b           0.101 ± 0.254
#>  6     3     2 b          -0.733 ± 0.229
#>  7     1     3 b           0.176 ± 0.071
#>  8     2     3 b          -0.178 ± 0.120
#>  9     3     3 b           0.069 ± 0.121
#> 10     1     4 b          -0.101 ± 0.121
#> 11     2     4 b           0.268 ± 0.221
#> 12     3     4 b           0.084 ± 0.207
#> 13     1    NA tau         6.041 ± 0.530
#> 14     2    NA tau         3.348 ± 0.534
#> 15     3    NA tau         3.663 ± 0.522
#> 16    NA    NA typical_r   0.544 ± 0.144