R/gather_draws.R
, R/spread_draws.R
spread_draws.Rd
Extract draws from a Bayesian model for one or more variables (possibly with named dimensions) into one of two types of long-format data frames.
gather_draws(
model,
...,
regex = FALSE,
sep = "[, ]",
ndraws = NULL,
seed = NULL,
n
)
spread_draws(
model,
...,
regex = FALSE,
sep = "[, ]",
ndraws = NULL,
seed = NULL,
n
)
A supported Bayesian model fit. Tidybayes supports a variety of model objects; for a full list of supported models, see tidybayes-models.
Expressions in the form of
variable_name[dimension_1, dimension_2, ...] | wide_dimension
. See Details.
If TRUE
, variable names are treated as regular expressions and all column matching the
regular expression and number of dimensions are included in the output. Default FALSE
.
Separator used to separate dimensions in variable names, as a regular expression.
The number of draws to return, or NULL
to return all draws.
A seed to use when subsampling draws (i.e. when ndraws
is not NULL
).
(Deprecated). Use ndraws
.
A data frame.
Imagine a JAGS or Stan fit named model
. The model may contain a variable named
b[i,v]
(in the JAGS or Stan language) with dimension i
in 1:100
and
dimension v
in 1:3
. However, the default format for draws returned from
JAGS or Stan in R will not reflect this indexing structure, instead
they will have multiple columns with names like "b[1,1]"
, "b[2,1]"
, etc.
spread_draws
and gather_draws
provide a straightforward
syntax to translate these columns back into properly-indexed variables in two different
tidy data frame formats, optionally recovering dimension types (e.g. factor levels) as it does so.
spread_draws
and gather_draws
return data frames already grouped by
all dimensions used on the variables you specify.
The difference between spread_draws
is that names of variables in the model will
be spread across the data frame as column names, whereas gather_draws
will
gather variables into a single column named ".variable"
and place values of variables into a
column named ".value"
. To use naming schemes from other packages (such as broom
), consider passing
results through functions like to_broom_names()
or to_ggmcmc_names()
.
For example, spread_draws(model, a[i], b[i,v])
might return a grouped
data frame (grouped by i
and v
), with:
column ".chain"
: the chain number. NA
if not applicable to the model
type; this is typically only applicable to MCMC algorithms.
column ".iteration"
: the iteration number. Guaranteed to be unique within-chain only.
NA
if not applicable to the model type; this is typically only applicable to MCMC algorithms.
column ".draw"
: a unique number for each draw from the posterior. Order is not
guaranteed to be meaningful.
column "i"
: value in 1:5
column "v"
: value in 1:10
column "a"
: value of "a[i]"
for draw ".draw"
column "b"
: value of "b[i,v]"
for draw ".draw"
gather_draws(model, a[i], b[i,v])
on the same model would return a grouped
data frame (grouped by i
and v
), with:
column ".chain"
: the chain number
column ".iteration"
: the iteration number
column ".draw"
: the draw number
column "i"
: value in 1:5
column "v"
: value in 1:10
, or NA
if ".variable"
is "a"
.
column ".variable"
: value in c("a", "b")
.
column ".value"
: value of "a[i]"
(when ".variable"
is "a"
)
or "b[i,v]"
(when ".variable"
is "b"
) for draw ".draw"
spread_draws
and gather_draws
can use type information
applied to the model
object by recover_types()
to convert columns
back into their original types. This is particularly helpful if some of the dimensions in
your model were originally factors. For example, if the v
dimension
in the original data frame data
was a factor with levels c("a","b","c")
,
then we could use recover_types
before spread_draws
:
model %>%
recover_types(data)
spread_draws(model, b[i,v])
Which would return the same data frame as above, except the "v"
column
would be a value in c("a","b","c")
instead of 1:3
.
For variables that do not share the same subscripts (or share
some but not all subscripts), we can supply their specifications separately.
For example, if we have a variable d[i]
with the same i
subscript
as b[i,v]
, and a variable x
with no subscripts, we could do this:
spread_draws(model, x, d[i], b[i,v])
Which is roughly equivalent to this:
spread_draws(model, x) %>%
inner_join(spread_draws(model, d[i])) %>%
inner_join(spread_draws(model, b[i,v])) %>%
group_by(i,v)
Similarly, this:
gather_draws(model, x, d[i], b[i,v])
Is roughly equivalent to this:
bind_rows(
gather_draws(model, x),
gather_draws(model, d[i]),
gather_draws(model, b[i,v])
)
The c
and cbind
functions can be used to combine multiple variable names that have
the same dimensions. For example, if we have several variables with the same
subscripts i
and v
, we could do either of these:
spread_draws(model, c(w, x, y, z)[i,v])
spread_draws(model, cbind(w, x, y, z)[i,v]) # equivalent
Each of which is roughly equivalent to this:
spread_draws(model, w[i,v], x[i,v], y[i,v], z[i,v])
Besides being more compact, the c()
-style syntax is currently also
faster (though that may change).
Dimensions can be omitted from the resulting data frame by leaving their names
blank; e.g. spread_draws(model, b[,v])
will omit the first dimension of
b
from the output. This is useful if a dimension is known to contain all
the same value in a given model.
The shorthand ..
can be used to specify one column that should be put
into a wide format and whose names will be the base variable name, plus a dot
("."), plus the value of the dimension at ..
. For example:
spread_draws(model, b[i,..])
would return a grouped data frame
(grouped by i
), with:
column ".chain"
: the chain number
column ".iteration"
: the iteration number
column ".draw"
: the draw number
column "i"
: value in 1:20
column "b.1"
: value of "b[i,1]"
for draw ".draw"
column "b.2"
: value of "b[i,2]"
for draw ".draw"
column "b.3"
: value of "b[i,3]"
for draw ".draw"
An optional clause in the form | wide_dimension
can also be used to put
the data frame into a wide format based on wide_dimension
. For example, this:
spread_draws(model, b[i,v] | v)
is roughly equivalent to this:
spread_draws(model, b[i,v]) %>% spread(v,b)
The main difference between using the |
syntax instead of the
..
syntax is that the |
syntax respects prototypes applied to
dimensions with recover_types()
, and thus can be used to get
columns with nicer names. For example:
%>% recover_types(data) %>% spread_draws(b[i,v] | v) model
would return a grouped data frame
(grouped by i
), with:
column ".chain"
: the chain number
column ".iteration"
: the iteration number
column ".draw"
: the draw number
column "i"
: value in 1:20
column "a"
: value of "b[i,1]"
for draw ".draw"
column "b"
: value of "b[i,2]"
for draw ".draw"
column "c"
: value of "b[i,3]"
for draw ".draw"
The shorthand .
can be used to specify columns that should be nested
into vectors, matrices, or n-dimensional arrays (depending on how many dimensions
are specified with .
).
For example, spread_draws(model, a[.], b[.,.])
might return a
data frame, with:
column ".chain"
: the chain number.
column ".iteration"
: the iteration number.
column ".draw"
: a unique number for each draw from the posterior.
column "a"
: a list column of vectors.
column "b"
: a list column of matrices.
Ragged arrays are turned into non-ragged arrays with
missing entries given the value NA
.
Finally, variable names can be regular expressions by setting regex = TRUE
; e.g.:
spread_draws(model, `b_.*`[i], regex = TRUE)
Would return a tidy data frame with variables starting with b_
and having one dimension.
library(dplyr)
library(ggplot2)
data(RankCorr, package = "ggdist")
RankCorr %>%
spread_draws(b[i, j])
#> # A tibble: 12,000 × 6
#> # Groups: i, j [12]
#> i j b .chain .iteration .draw
#> <int> <int> <dbl> <int> <int> <int>
#> 1 1 1 -0.927 1 1 1
#> 2 1 1 -0.979 1 2 2
#> 3 1 1 -1.15 1 3 3
#> 4 1 1 -1.09 1 4 4
#> 5 1 1 -1.20 1 5 5
#> 6 1 1 -1.07 1 6 6
#> 7 1 1 -1.11 1 7 7
#> 8 1 1 -1.06 1 8 8
#> 9 1 1 -0.831 1 9 9
#> 10 1 1 -0.986 1 10 10
#> # … with 11,990 more rows
RankCorr %>%
spread_draws(b[i, j], tau[i], u_tau[i])
#> # A tibble: 12,000 × 8
#> # Groups: i, j [12]
#> i j b .chain .iteration .draw tau u_tau
#> <int> <int> <dbl> <int> <int> <int> <dbl> <dbl>
#> 1 1 1 -0.927 1 1 1 5.79 5.87
#> 2 1 1 -0.979 1 2 2 6.26 4.91
#> 3 1 1 -1.15 1 3 3 7.38 3.34
#> 4 1 1 -1.09 1 4 4 5.97 6.96
#> 5 1 1 -1.20 1 5 5 6.01 5.30
#> 6 1 1 -1.07 1 6 6 7.03 5.34
#> 7 1 1 -1.11 1 7 7 7.39 5.53
#> 8 1 1 -1.06 1 8 8 5.98 6.79
#> 9 1 1 -0.831 1 9 9 6.75 5.40
#> 10 1 1 -0.986 1 10 10 6.76 6.63
#> # … with 11,990 more rows
RankCorr %>%
gather_draws(b[i, j], tau[i], u_tau[i])
#> # A tibble: 18,000 × 7
#> # Groups: i, j, .variable [18]
#> i j .chain .iteration .draw .variable .value
#> <int> <int> <int> <int> <int> <chr> <dbl>
#> 1 1 1 1 1 1 b -0.927
#> 2 1 1 1 2 2 b -0.979
#> 3 1 1 1 3 3 b -1.15
#> 4 1 1 1 4 4 b -1.09
#> 5 1 1 1 5 5 b -1.20
#> 6 1 1 1 6 6 b -1.07
#> 7 1 1 1 7 7 b -1.11
#> 8 1 1 1 8 8 b -1.06
#> 9 1 1 1 9 9 b -0.831
#> 10 1 1 1 10 10 b -0.986
#> # … with 17,990 more rows
RankCorr %>%
gather_draws(tau[i], typical_r) %>%
median_qi()
#> # A tibble: 4 × 8
#> i .variable .value .lower .upper .width .point .interval
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 tau 6.03 5.03 7.11 0.95 median qi
#> 2 2 tau 3.30 2.41 4.46 0.95 median qi
#> 3 3 tau 3.65 2.73 4.72 0.95 median qi
#> 4 NA typical_r 0.548 0.309 0.778 0.95 median qi