R/gather_draws.R
, R/spread_draws.R
spread_draws.Rd
Extract draws from a Bayesian model for one or more variables (possibly with named dimensions) into one of two types of long-format data frames.
A supported Bayesian model fit. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.
Expressions in the form of
variable_name[dimension_1, dimension_2, ...] | wide_dimension
. See Details.
If TRUE
, variable names are treated as regular expressions and all column matching the
regular expression and number of dimensions are included in the output. Default FALSE
.
Separator used to separate dimensions in variable names, as a regular expression.
The number of draws to return, or NULL
to return all draws.
A seed to use when subsampling draws (i.e. when ndraws
is not NULL
).
Character vector of column names that should be treated
as indices of draws. Operations are done within combinations of these values.
The default is c(".chain", ".iteration", ".draw")
, which is the same names
used for chain, iteration, and draw indices returned by tidy_draws()
.
Names in draw_indices
that are not found in the data are ignored.
(Deprecated). Use ndraws
.
A data frame.
Imagine a JAGS or Stan fit named model
. The model may contain a variable named
b[i,v]
(in the JAGS or Stan language) with dimension i
in 1:100
and
dimension v
in 1:3
. However, the default format for draws returned from
JAGS or Stan in R will not reflect this indexing structure, instead
they will have multiple columns with names like "b[1,1]"
, "b[2,1]"
, etc.
spread_draws
and gather_draws
provide a straightforward
syntax to translate these columns back into properly-indexed variables in two different
tidy data frame formats, optionally recovering dimension types (e.g. factor levels) as it does so.
spread_draws
and gather_draws
return data frames already grouped by
all dimensions used on the variables you specify.
The difference between spread_draws
is that names of variables in the model will
be spread across the data frame as column names, whereas gather_draws
will
gather variables into a single column named ".variable"
and place values of variables into a
column named ".value"
. To use naming schemes from other packages (such as broom
), consider passing
results through functions like to_broom_names()
or to_ggmcmc_names()
.
For example, spread_draws(model, a[i], b[i,v])
might return a grouped
data frame (grouped by i
and v
), with:
column ".chain"
: the chain number. NA
if not applicable to the model
type; this is typically only applicable to MCMC algorithms.
column ".iteration"
: the iteration number. Guaranteed to be unique within-chain only.
NA
if not applicable to the model type; this is typically only applicable to MCMC algorithms.
column ".draw"
: a unique number for each draw from the posterior. Order is not
guaranteed to be meaningful.
column "i"
: value in 1:5
column "v"
: value in 1:10
column "a"
: value of "a[i]"
for draw ".draw"
column "b"
: value of "b[i,v]"
for draw ".draw"
gather_draws(model, a[i], b[i,v])
on the same model would return a grouped
data frame (grouped by i
and v
), with:
column ".chain"
: the chain number
column ".iteration"
: the iteration number
column ".draw"
: the draw number
column "i"
: value in 1:5
column "v"
: value in 1:10
, or NA
if ".variable"
is "a"
.
column ".variable"
: value in c("a", "b")
.
column ".value"
: value of "a[i]"
(when ".variable"
is "a"
)
or "b[i,v]"
(when ".variable"
is "b"
) for draw ".draw"
spread_draws
and gather_draws
can use type information
applied to the model
object by recover_types()
to convert columns
back into their original types. This is particularly helpful if some of the dimensions in
your model were originally factors. For example, if the v
dimension
in the original data frame data
was a factor with levels c("a","b","c")
,
then we could use recover_types
before spread_draws
:
model %>%
recover_types(data)
spread_draws(model, b[i,v])
Which would return the same data frame as above, except the "v"
column
would be a value in c("a","b","c")
instead of 1:3
.
For variables that do not share the same subscripts (or share
some but not all subscripts), we can supply their specifications separately.
For example, if we have a variable d[i]
with the same i
subscript
as b[i,v]
, and a variable x
with no subscripts, we could do this:
spread_draws(model, x, d[i], b[i,v])
Which is roughly equivalent to this:
spread_draws(model, x) %>%
inner_join(spread_draws(model, d[i])) %>%
inner_join(spread_draws(model, b[i,v])) %>%
group_by(i,v)
Similarly, this:
gather_draws(model, x, d[i], b[i,v])
Is roughly equivalent to this:
bind_rows(
gather_draws(model, x),
gather_draws(model, d[i]),
gather_draws(model, b[i,v])
)
The c
and cbind
functions can be used to combine multiple variable names that have
the same dimensions. For example, if we have several variables with the same
subscripts i
and v
, we could do either of these:
spread_draws(model, c(w, x, y, z)[i,v])
spread_draws(model, cbind(w, x, y, z)[i,v]) # equivalent
Each of which is roughly equivalent to this:
spread_draws(model, w[i,v], x[i,v], y[i,v], z[i,v])
Besides being more compact, the c()
-style syntax is currently also
faster (though that may change).
Dimensions can be omitted from the resulting data frame by leaving their names
blank; e.g. spread_draws(model, b[,v])
will omit the first dimension of
b
from the output. This is useful if a dimension is known to contain all
the same value in a given model.
The shorthand ..
can be used to specify one column that should be put
into a wide format and whose names will be the base variable name, plus a dot
("."), plus the value of the dimension at ..
. For example:
spread_draws(model, b[i,..])
would return a grouped data frame
(grouped by i
), with:
column ".chain"
: the chain number
column ".iteration"
: the iteration number
column ".draw"
: the draw number
column "i"
: value in 1:20
column "b.1"
: value of "b[i,1]"
for draw ".draw"
column "b.2"
: value of "b[i,2]"
for draw ".draw"
column "b.3"
: value of "b[i,3]"
for draw ".draw"
An optional clause in the form | wide_dimension
can also be used to put
the data frame into a wide format based on wide_dimension
. For example, this:
spread_draws(model, b[i,v] | v)
is roughly equivalent to this:
spread_draws(model, b[i,v]) %>% spread(v,b)
The main difference between using the |
syntax instead of the
..
syntax is that the |
syntax respects prototypes applied to
dimensions with recover_types()
, and thus can be used to get
columns with nicer names. For example:
would return a grouped data frame
(grouped by i
), with:
column ".chain"
: the chain number
column ".iteration"
: the iteration number
column ".draw"
: the draw number
column "i"
: value in 1:20
column "a"
: value of "b[i,1]"
for draw ".draw"
column "b"
: value of "b[i,2]"
for draw ".draw"
column "c"
: value of "b[i,3]"
for draw ".draw"
The shorthand .
can be used to specify columns that should be nested
into vectors, matrices, or n-dimensional arrays (depending on how many dimensions
are specified with .
).
For example, spread_draws(model, a[.], b[.,.])
might return a
data frame, with:
column ".chain"
: the chain number.
column ".iteration"
: the iteration number.
column ".draw"
: a unique number for each draw from the posterior.
column "a"
: a list column of vectors.
column "b"
: a list column of matrices.
Ragged arrays are turned into non-ragged arrays with
missing entries given the value NA
.
Finally, variable names can be regular expressions by setting regex = TRUE
; e.g.:
spread_draws(model, `b_.*`[i], regex = TRUE)
Would return a tidy data frame with variables starting with b_
and having one dimension.
library(dplyr)
library(ggplot2)
data(RankCorr, package = "ggdist")
RankCorr %>%
spread_draws(b[i, j])
#> # A tibble: 12,000 × 6
#> # Groups: i, j [12]
#> i j b .chain .iteration .draw
#> <int> <int> <dbl> <int> <int> <int>
#> 1 1 1 -0.927 1 1 1
#> 2 1 1 -0.979 1 2 2
#> 3 1 1 -1.15 1 3 3
#> 4 1 1 -1.09 1 4 4
#> 5 1 1 -1.20 1 5 5
#> 6 1 1 -1.07 1 6 6
#> 7 1 1 -1.11 1 7 7
#> 8 1 1 -1.06 1 8 8
#> 9 1 1 -0.831 1 9 9
#> 10 1 1 -0.986 1 10 10
#> # ℹ 11,990 more rows
RankCorr %>%
spread_draws(b[i, j], tau[i], u_tau[i])
#> # A tibble: 12,000 × 8
#> # Groups: i, j [12]
#> i j b .chain .iteration .draw tau u_tau
#> <int> <int> <dbl> <int> <int> <int> <dbl> <dbl>
#> 1 1 1 -0.927 1 1 1 5.79 5.87
#> 2 1 1 -0.979 1 2 2 6.26 4.91
#> 3 1 1 -1.15 1 3 3 7.38 3.34
#> 4 1 1 -1.09 1 4 4 5.97 6.96
#> 5 1 1 -1.20 1 5 5 6.01 5.30
#> 6 1 1 -1.07 1 6 6 7.03 5.34
#> 7 1 1 -1.11 1 7 7 7.39 5.53
#> 8 1 1 -1.06 1 8 8 5.98 6.79
#> 9 1 1 -0.831 1 9 9 6.75 5.40
#> 10 1 1 -0.986 1 10 10 6.76 6.63
#> # ℹ 11,990 more rows
RankCorr %>%
gather_draws(b[i, j], tau[i], u_tau[i])
#> # A tibble: 18,000 × 7
#> # Groups: i, j, .variable [18]
#> i j .chain .iteration .draw .variable .value
#> <int> <int> <int> <int> <int> <chr> <dbl>
#> 1 1 1 1 1 1 b -0.927
#> 2 1 1 1 2 2 b -0.979
#> 3 1 1 1 3 3 b -1.15
#> 4 1 1 1 4 4 b -1.09
#> 5 1 1 1 5 5 b -1.20
#> 6 1 1 1 6 6 b -1.07
#> 7 1 1 1 7 7 b -1.11
#> 8 1 1 1 8 8 b -1.06
#> 9 1 1 1 9 9 b -0.831
#> 10 1 1 1 10 10 b -0.986
#> # ℹ 17,990 more rows
RankCorr %>%
gather_draws(tau[i], typical_r) %>%
median_qi()
#> # A tibble: 4 × 8
#> i .variable .value .lower .upper .width .point .interval
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 tau 6.03 5.03 7.11 0.95 median qi
#> 2 2 tau 3.30 2.41 4.46 0.95 median qi
#> 3 3 tau 3.65 2.73 4.72 0.95 median qi
#> 4 NA typical_r 0.548 0.309 0.778 0.95 median qi