Saving Tools
This page discusses numerous tools that can significantly improve process of saving & loading files, always in a scientific context.
These tools are also used in the examples demonstrated in the Real World Examples page. After reading the proper documentation here it might be worth it to have a look there as well!
In DrWatson we save and load files with the functions wsave(filename, data)
and wload(filename)
. These functions are further used in the tools below, like e.g. tagsave
and can be overloaded for your own specific datatype.
In addition, wsave
ensures that mkpath
is always called on the path you are trying to save your file at. We all know how unpleasant it is to run a 2-hour simulation and save no data because FileIO.save
complains that the path you are trying to save at does not exist...
To overload the saving part, add a new method to DrWatson._wsave(filename, ::YourType, args...; kwargs...)
(notice the _
!). By overloading _wsave
you get all the extra functionality of tagsave
, safesave
, etc., for free for your own types (tagsave
requires that you save your data as a dictionary, or extend tag!
for your own type).
By default we fallback to FileIO.save
and FileIO.load
for and types. This means that you have to install yourself whatever saving backend you want to use. FileIO
by itself does not install a package that saves data, it only provides the interface!
The suffix of the file name determines which package will be used for actually saving the file. It is your responsibility to know how the saving package works and what input it expects!
Safely saving data
Almost all packages that save data by default overwrite existing files (if given a save name of an existing file). This is the default behavior because often it is desired.
Sometimes it is not though! And the consequences of overwritten data can range from irrelevant to catastrophic. To avoid such an event we provide an alternative way to save data that will never overwrite existing files:
DrWatson.safesave
— Functionsafesave(filename, data...; kwargs...)
Safely save data
in filename
by ensuring that no existing files are overwritten. Do this by renaming already existing data with a backup-number ending like #1, #2, ...
. For example if filename = test.jld2
, the first time you safesave
it, the file is saved normally. The second time the existing save is renamed to test_#1.jld2
and a new file test.jld2
is then saved.
If a backup file already exists then its backup-number is incremented (e.g. going from #2
to #3
). For example safesaving test.jld2
a third time will rename the old test_#1.jld2
to test_#2.jld2
, rename the old test.jld2
to test_#1.jld2
and then save a new test.jld2
with the latest data
.
Any additional keyword arguments are passed through to wsave (to e.g. enable compression).
See also tagsave
.
Tagging a run using Git
For reproducibility reasons (and also to not go insane when asking "HOW DID I GET THOSE RESUUUULTS") it is useful to "tag" any simulation/result/process using the Git status of the repository.
To this end we have some functions that can be used to ensure reproducibility:
DrWatson.tagsave
— Functiontagsave(file::String, d::AbstractDict; kwargs...)
First tag!
dictionary d
and then save d
in file
.
"Tagging" means that when saving the dictionary, an extra field :gitcommit
is added to establish reproducibility of results using Git. If the Git repository is dirty and storepatch=true
, one more field :gitpatch
is added that stores the difference string. If a dictionary already contains a key :gitcommit
, it is not overwritten, unless force=true
. For more details, see tag!
.
Keywords gitpath, storepatch, force,
are propagated to tag!
. Any additional keyword arguments are propagated to wsave
, to e.g. enable compression.
The keyword safe = DrWatson.readenv("DRWATSON_SAFESAVE", false)
decides whether to save the file using safesave
.
DrWatson.@tagsave
— Macro@tagsave(file::String, d::AbstractDict; kwargs...)
Same as tagsave
but one more field :script
is added that records the local path of the script and line number that called @tagsave
, see @tag!
.
The functions also incorporate safesave
if need be.
Low level functions
@tagsave
internally uses the following low level functions:
DrWatson.tag!
— Functiontag!(d::AbstractDict; kwargs...) -> d
Tag d
by adding an extra field gitcommit
which will have as value the gitdescribe
of the repository at gitpath
(by default the project's gitpath). Do nothing if a key gitcommit
already exists (unless force=true
then replace with the new value) or if the Git repository is not found. If the git repository is dirty, i.e. there are un-commited changes, and storepatch
is true, then the output of git diff HEAD
is stored in the field gitpatch
. Note that patches for binary files are not stored. You can use isdirty
to check if a repo is dirty. If the commit message
is set to true
, then the dictionary d
will include an additional field "gitmessage"
and will contain the git message associated with the commit.
Notice that the key-type of the dictionary must be String
or Symbol
. If String
is a subtype of the value type of the dictionary, this operation is in-place. Otherwise a new dictionary is created and returned.
To restore a repository to the state of a particular git commit do:
- checkout the relevant commit with
git checkout xyz
wherexyz
is the value stored - (optional) apply the patch
git apply patch
, where the string stored in thegitpatch
field needs to be written to the filepatch
.
Keywords
gitpath = projectdir()
force = false
storepatch = DrWatson.readenv("DRWATSON_STOREPATCH", false)
: Whether to collect and store the output ofgitpatch
as well. By default it isfalse
.kw...
: extra keywords are propagated togitdescribe
.
Examples
julia> d = Dict(:x => 3, :y => 4)
Dict{Symbol,Int64} with 2 entries:
:y => 4
:x => 3
julia> tag!(d; commit_message=true)
Dict{Symbol,Any} with 3 entries:
:y => 4
:gitmessage => "File set up by DrWatson"
:gitcommit => "96df587e45b29e7a46348a3d780db1f85f41de04"
:x => 3
DrWatson.@tag!
— Macro@tag!(d, gitpath = projectdir(), storepatch = true, force = false) -> d
Do the same as tag!
but also add another field script
that has the path of the script that called @tag!
, relative with respect to gitpath
. The saved string ends with #line_number
, which indicates the line number within the script that @tag!
was called at.
Examples
julia> d = Dict(:x => 3)Dict{Symbol,Int64} with 1 entry:
:x => 3
julia> @tag!(d) # running from a script or inline evaluation
Dict{Symbol,Any} with 3 entries:
:gitcommit => "618b72bc0936404ab6a4dd8d15385868b8299d68"
:script => "test\stools_tests.jl#10"
:x => 3
DrWatson.gitdescribe
— Functiongitdescribe(gitpath = projectdir(); dirty_suffix = "-dirty", warn = true) -> gitstr
Return a string gitstr
with the output of git describe
if an annotated git tag exists, otherwise the current active commit id of the Git repository present in gitpath
, which by default is the currently active project.
If the repository is dirty when this function is called the string will end with the value of the keyword dirty_suffix
. When this happens, the keyword warn = DrWatson.readenv("DRWATSON_WARN_DIRTY", true)
will trigger a warning to be printed if it is true
(the default).
Return nothing
if gitpath
is not a Git repository, i.e. a directory within a git repository.
The format of the git describe
output in general is
`"TAGNAME-[NUMBER_OF_COMMITS_AHEAD-]gLATEST_COMMIT_HASH[-dirty]"`
If the latest tag is v1.2.3
and there are 5 additional commits while the latest commit hash is 334a0f225d9fba86161ab4c8892d4f023688159c, the output will be v1.2.3-5-g334a0f
. Notice that git will shorten the hash if there are no ambiguous commits.
More information about the git describe
output can be found on (https://git-scm.com/docs/git-describe)
See also tag!
.
Examples
julia> gitdescribe() # a tag exists
"v1.2.3-g7364ab"
julia> gitdescribe() # a tag doesn't exist
"96df587e45b29e7a46348a3d780db1f85f41de04"
julia> gitdescribe(path_to_a_dirty_repo)
"3bf684c6a115e3dce484b7f200b66d3ced8b0832-dirty"
DrWatson.gitpatch
— Functiongitpatch(gitpath = projectdir())
Generates a patch describing the changes of a dirty repository compared to its last commit; i.e. what git diff HEAD
produces. The gitpath
needs to point to a directory within a git repository, otherwise nothing
is returned.
Be aware that gitpatch
needs a working installation of Git, that can be found in the current PATH.
DrWatson.isdirty
— Functionisdirty(gitpath = projectdir()) -> Bool
Return true
if gitpath
is the path to a dirty Git repository, false
otherwise.
Note that unlike tag!
, isdirty
can error (for example, if the path passed to it doesn't exist, or isn't a Git repository). The purpose of isdirty
is to be used as a check before running simulations, for users that do not wish to tag data while having a dirty git repo.
Please notice that tag!
will operate in place only when possible. If not possible then a new dictionary is returned. Also (importantly) these functions will never error as they are most commonly used when saving simulations and this could risk data not being saved!
Produce or Load
produce_or_load
is a function that very conveniently integrates with savename
to either load a file if it exists, or if it doesn't to produce it, save it and then return it!
This saves you the effort of checking if a file exists and then loading, or then running some code and saving, or writing a bunch of if
clauses in your code. In addition, it attempts to minimize computing energy spent on getting a result.
DrWatson.produce_or_load
— Functionproduce_or_load(f::Function, config, path = ""; kwargs...) -> data, file
The goal of produce_or_load
is to avoid running some data-producing code that has already been run with a given configuration container config
. If the output of some function f(config)
exists on disk, produce_or_load
will load it and return it, and if not, it will produce it, save it, and then return it.
Here is how it works:
- The output data are saved in a file named
name = filename(config)
. I.e., the output file's name is created from the configuration containerconfig
. By default, this isname =
savename
(config)
, but can be configured differently, using e.g.hash
, see keywordfilename
below. See alsoproduce_or_load
with hash codes for an example whereconfig
would be hard to put intoname
withsavename
, andhash
is used instead. - Now, let
file = joinpath(path, name)
. - If
file
exists, load it and return the containeddata
, along with the global path that it is saved at (file
). - If the file does not exist then call
data = f(config)
, withf
your function that produces your data from the configuration container. - Then save the
data
asfile
and then returndata, file
.
The function f
should return a string-keyed dictionary if the data are saved in the default format of JLD2.jl., the macro @strdict
can help with that.
You can use a do-block instead of defining a function to pass in. For example,
produce_or_load(config, path) do config
# code using `config` runs here
# and then returns a dictionary to be saved
end
Keywords
Name deciding
filename::Union{Function, String} = savename
: Configures thename
of the file to produce or load given the configuration container. It may be a one-argument function ofconfig
,savename
by default, so thatname = filename(config)
. Useful alternative tosavename
ishash
. The keywordfilename
could also be aString
directly, possibly extracted fromconfig
before callingproduce_or_load
, in which casename = filename
.suffix = "jld2", prefix = default_prefix(config)
: If not empty, added toname
asname = prefix*'_'*name*'.'*suffix
(i.e., like insavename
).
Saving
tag::Bool = DrWatson.readenv("DRWATSON_TAG", istaggable(suffix))
: Save the file usingtagsave
iftrue
(which is the default).gitpath, storepatch
: Given totagsave
iftag
istrue
.force = false
: Iftrue
then don't check iffile
exists and produce it and save it anyway.loadfile = true
: Iffalse
, this function does not actually load the file, but only checks if it exists. The return value in this case is alwaysnothing, file
, regardless of whether the file exists or not. If it doesn't exist it is still produced and saved.verbose = true
: print info about the process, if the file doesn't exist.wsave_kwargs = (;)
: Keywords to pass towsave
(e.g. to enable compression). Defaults to an empty named tuple.wload_kwargs = (;)
: Keywords to pass towload
(e.g. to pass a typemap). Defaults to an empty named tuple.
DrWatson.@produce_or_load
— Macro@produce_or_load(f, config, path; kwargs...)
Same as produce_or_load
but one more field :script
is added that records the local path of the script and line number that called @produce_or_load
, see @tag!
.
Notice that path
here is mandatory in contrast to produce_or_load
.
DrWatson.istaggable
— Functionistaggable(file::AbstractStrig) → bool
Return true
if the file save format (file ending) is "taggable", i.e. allows adding additional data fields as strings. Currently endings that can do this are:
("bson", "jld", "jld2")
istaggable(x) = x isa AbstractDict
For non-string input the function just checks if input is dictionary.
See Stopping "Did I run this?" for an example usage of produce_or_load
. While produce_or_load
will try to by default tag your data if possible, you can also use it with other formats. An example is when your simulation function f
returns a DataFrame
and the file suffix is "csv"
. In this case tagging will not happen, but produce_or_load
will work as expected.
Converting a struct to a dictionary
savename
gives great support for getting a name out of any Julia composite type. To save something though, one needs a dictionary. So the following function can be conveniently used to directly save a struct using any saving function:
DrWatson.struct2dict
— Functionstruct2dict([type = Dict,] s) -> d
Convert a Julia composite type s
to a dictionary d
with key type Symbol
that maps each field of s
to its value. Simply passing s
will return a regular dictionary. This can be useful in e.g. saving:
tagsave(savename(s), struct2dict(s))
DrWatson.struct2ntuple
— Functionstruct2ntuple(s) -> n
Convert a Julia composite type s
to a NamedTuple n
.