Real World Examples

Easy local directories

I setup all my science projects using DrWatson's suggested setup, using initialize_project. Then, every file in every project has a start that looks like this:

using DrWatson
@quickactivate "MagneticBilliardsLyapunovs"
using DynamicalBilliards, GLMakie, LinearAlgebra

include(srcdir("plot_perturbationgrowth.jl"))
include(srcdir("unitcells.jl"))

In all projects I save data/plots using datadir/plotsdir:

@tagsave(datadir("mushrooms", "Λ_N=$N.jld2"), (@strdict(Λ, Λσ, ws, hs, description)))

The advantage of this approach is that it will always work regardless of if I move the specific file to a different subfolder (which is very often necessary) or whether I move the entire project folder somewhere else! Please be sure you have understood the caveat of using quickactivate!

Here is an example from another project. You will notice that another advantage is that I can use identical syntax to access the data or source folders even though I have different projects!

using DrWatson
@quickactivate "EmbeddingResearch"
using Parameters
using TimeseriesPrediction, LinearAlgebra, Statistics

include(srcdir("systems", "barkley.jl"))
include(srcdir("nrmse.jl"))

# stuff...

save(datadir("sim", "barkley", "astonishing_results.jld2"), data)

Making your project a usable module

For some projects, it is often the case that some packages and files from the source folder are loaded at the beginning of every file of the project. For example, I have a project that I know that for any script I will write, the first five lines will be:

using DrWatson
@quickactivate "AlbedoProperties"
using Dates, Statistics, NCDatasets
include(srcdir("core.jl"))
include(srcdir("style.jl"))

It would be quite convenient to group all of these commands into one file and instead load that file, for example do include(srcdir("everything.jl")) and all commands are in there.

We can do even better though! Because of the way Julia handles project and module paths, it is in fact possible to transform the currently active project into a usable module. If one defines inside the src folder a file AlbedoProperties.jl and in that file define a module AlbedoProperties (notice that these names must match exactly the project name), then upon doing using AlbedoProperties Julia will in fact just bring this module into scope.

So what I end up doing (for some projects where this makes sense) is creating the aforementioned file and putting inside things like

module AlbedoProperties

using Reexport
@reexport using Dates, Statistics
using NCDatasets: NCDataset, dimnames, NCDatasets
export NCDataset, dimnames
include("core.jl") # this file now also has export statements
include("style.jl")

end

and then the header of all my files is transformed to

using DrWatson
@quickactivate :AlbedoProperties

which takes advantage of @quickactivate's feature to essentially combine the commands @quickactivate "AlbedoProperties" and using AlbedoProperties into one.

Please note: In this section, it's assumed that you created the project using initialize_project which will put the project's name into the Project.toml file. If you created your project in any other way, you need to ensure that the name is set at the top of Project.toml, i.e. for the example above there is a line

name = "AlbedoProperties"

If that line is absent, you will get an error like:

ERROR: ArgumentError: Package AlbedoProperties not found in current path.

savename and tagging

The combination of using savename and tagsave makes it easy and fast to save output in a way that is consistent, robust and reproducible. Here is an example from a project:

using DrWatson
quickactivate(@__DIR__, "EmbeddingResearch")
using TimeseriesPrediction, LinearAlgebra, Statistics
include(srcdir("systems", "barkley.jl"))

ΔTs = [1.0, 0.5, 0.1] # resolution of the saved data
Ns = [50, 150] # spatial extent
for N ∈ Ns, ΔT ∈ ΔTs
    T = 10050 # we can offset up to 1000 units
    every = round(Int, ΔT/barkley_Δt)
    seed = 1111

    simulation = @ntuple T N ΔT seed
    U, V = barkley(T, N, every; seed = seed)

    @tagsave(
        datadir("sim", "bk", savename(simulation, "jld2")),
        @strdict U V simulation
    )
end

This saves files that look like:

path/to/project/data/sim/bk_N=50_T=10050_seed=1111_ΔT=1.jld2

and each file is a dictionary that has my data fields: :U, :V, :simulation, but also :gitcommit, :script. When I read this file I know exactly what was the source code that produced it (provided that I am not sloppy and commit code changes regularly :P).

Customizing savename

Here is a simple example for customizing savename. We are using a common struct Experiment across different experiments with cats and mice.

We first define the relevant types.

using DrWatson, Dates
using Base: @kwdef # for defining structs with keyword values

# Define a type hierarchy we use at experiments
abstract type Species end
struct Mouse <: Species end
struct Cat <: Species end

# @with_kw comes from Parameters.jl
@kwdef struct Experiment{S<:Species}
    n::Int = 50
    c::Float64 = 10.0
    x::Float64 = 0.2
    date::Date = Date(Dates.now())
    species::S = Mouse()
    scientist::String = "George"
end

e1 = Experiment()
e2 = Experiment(species = Cat())
Main.Experiment{Main.Cat}(50, 10.0, 0.2, Dates.Date("2024-12-01"), Main.Cat(), "George")

For analyzing our experiments we need information about the species used, and to use multiple dispatch later on we decided to make this information associated with a Type. This is why we defined Species.

Now, we want to customize savename. We start by extending DrWatson.default_prefix:

DrWatson.default_prefix(e::Experiment) = "Experiment_"*string(e.date)

savename(e1)
"Experiment_2024-12-01_c=10.0_date=2024-12-01_n=50_scientist=George_x=0.2"

However this is not good enough for us, as the information about the species is not contained in savename and also the date information is duplicated. We have to extend DrWatson.default_allowed to specify which data types should be extended in savename:

DrWatson.default_allowed(::Experiment) = (Real, String, Species)

savename(e1)
"Experiment_2024-12-01_c=10.0_n=50_scientist=George_species=Main.Mouse()_x=0.2"

To make printing of Species better we can extend Base.string, which is what DrWatson uses internally in savename to display values.

Base.string(::Mouse) = "mouse"
Base.string(::Cat) = "cat"
savename(e1)
"Experiment_2024-12-01_c=10.0_n=50_scientist=George_species=mouse_x=0.2"

Lastly, let's say that the information of which scientist performed the experiment is not really relevant for savename. We can extend the last method, DrWatson.allaccess:

DrWatson.allaccess(::Experiment) = (:n, :c, :x, :species)

so that only those four fields will be used (notice that the date field is already used in default_prefix). We finally have:

println( savename(e1) )
println( savename(e2) )
Experiment_2024-12-01_c=10.0_n=50_species=mouse_x=0.2
Experiment_2024-12-01_c=10.0_n=50_species=cat_x=0.2

savename and nested containers

In the case of user-defined structs and projects of significant complexity, it is often necessary that your "main" container has other containers as subfields. savename can adapt to these situations as well. Consider the following example, where I need a core struct that represents a spatiotemporal system, and its simulation:

struct SpatioTemporalSystem
    model::String # system codeword
    N        # Integer or Tuple of integers: spatial extent
    Δt::Real # sampling time in real time units
    p        # parameters. nothing or Dict{Symbol}
end
const STS = SpatioTemporalSystem

struct SpatioTemporalTimeseries
    sts::STS
    T::Int       # total frame amount
    ic           # initial condition (matrix, string, seed)
    fields::Dict # resulting timeseries, dictionary of string to vector
end
const STT = SpatioTemporalTimeseries
Main.SpatioTemporalTimeseries

For my use case, p can be nothing or it can be a dictionary itself, containing the possible parameters the spatiotemporal systems can have. To adapt savename to situations like this, we use the functionality surrounding DrWatson.default_expand.

Expanding the necessary methods allows me to do:

DrWatson.allaccess(c::STS) = (:N, :Δt, :p)
DrWatson.default_prefix(c::STS) = c.model
DrWatson.default_allowed(c::STS) = (Real, Tuple, Dict, String)
DrWatson.default_expand(c::STS) = ["p"]

bk = STS("barkley", 60, 0.1, nothing)
savename(bk)
"barkley_N=60_Δt=0.1"

and when I do want to use different parameters than the default:

a = 0.3; b = 0.5
bk = STS("barkley", 60, 0.1, @dict a b)
savename(bk)
"barkley_N=60_p=(a=0.3,b=0.5)_Δt=0.1"

Expanding to the second struct is also fine:

DrWatson.default_prefix(c::STT) = savename(c.sts)
stt = STT(bk, 1000, nothing, Dict("U"=>rand(100), "V"=>rand(100)))
savename(stt)
"barkley_N=60_p=(a=0.3,b=0.5)_Δt=0.1_T=1000"

Stopping "Did I run this?"

It can become very tedious to have a piece of code that you may or may not have run and may or may not have saved the produced data. You then constantly ask yourself "Did I run this?". Depending on how costly running the code is, having a good framework to answer this question can become very important!

This is the role of produce_or_load. You can wrap your code in a function and then produce_or_load will take care of the rest for you! I found it especially useful in scripts that generate figures for a publication.

Here is an example; originally I had this piece of code:

HTEST = 0.1:0.1:2.0
WS = [0.5, 1.0, 1.5]
N = 10000; T = 10000.0

toypar_h = [[] for l in WS]
for (wi, w) in enumerate(WS)
    println("w = $w")
    for h in HTEST
        toyp = toyparameters(h, w, N, T)
        push!(toypar_h[wi], toyp)
    end
end

that was taking some minutes to run. To use the function produce_or_load I first have to wrap this code in a high level function like so:

function simulation(config)
    HTEST = 0.1:0.1:2.0
    WS = [0.5, 1.0, 1.5]
    @unpack N, T = config
    toypar_h = [[] for _ in WS]

    for (wi, w) in enumerate(WS)
        println("w = $w")
        for h in HTEST
            toyp = toyparameters(h, w, N, T)
            push!(toypar_h[wi], toyp)
        end
    end
    return @strdict toypar_h
end

N = 2000; T = 2000.0
data, file = produce_or_load(
    simulation, # function
    @dict(N, T), # container
    datadir("mushrooms", "toy"), # path
    prefix = "fig5_toyparams" # prefix for savename
)
@unpack toypar_h = data

Now, every time I run this code block the function tests automatically whether the file exists. Only if it does not, then the code is run while the new result is saved to ensure I won't have to run it again.

The extra step is that I have to extract the useful data I need from the container file. Thankfully the @unpack macro, or if your are using Julia v1.5 or later, the named decomposition syntax, (; a, b) = config, makes unpacking super easy.

produce_or_load with hash codes

As displayed above, the default setting of produce_or_load uses savename to extract the filename from the configuration input. This file name is used to check whether the program has run and its output has been saved or not. However, in some situations you may too many parameters, or complicated nested structs, and encoding these simply using savename is not possible or simply inconvenient.

Thankfully, instead of savename we can use base Julia's hash function as we will illustrate in the following example.

using DrWatson
using Random

function sim_large_c(config)
    @unpack x, f = config
    r = sum(x)*f.a + f.t.b + f.t.c
    return @strdict(r)
end

## Some nested structs
f1 = (a = 1, t = (b = 2, c = 3))
f2 = (a = 2, t = (b = 4, c = 5))
## some containers with too many parameters
rng = Random.MersenneTwister(1234)
x1 = rand(Random.MersenneTwister(1234), 1000)
x2 = randn(Random.MersenneTwister(1234), 20)

preconfigs = Dict("x" => [x1, x2], "f" => [f1, f2])
configs = dict_list(preconfigs)

path = mktempdir()
pol_kwargs = (prefix = "sim_large_c", verbose = false, tag = false)

for config in configs
    produce_or_load(sim_large_c, config, path; pol_kwargs...)
end

readdir(path)
1-element Vector{String}:
 "sim_large_c.jld2"

as you can see this is obviously useless :D savename didn't return anything from the given config containers so all data had the same name. Let's use hash instead:

rm(joinpath(path, "sim_large_c.jld2"))
for config in configs
    produce_or_load(sim_large_c, config, path; filename = hash, pol_kwargs...)
end
readdir(path)
4-element Vector{String}:
 "sim_large_c_2481395954728946486.jld2"
 "sim_large_c_5275202029616168210.jld2"
 "sim_large_c_8887498325681316289.jld2"
 "sim_large_c_9332398010852946612.jld2"

Lovely. But, just to be on the safe side, if we use a different input x but of same type and size would we get a different file name (as desired)?

config = Dict("x" => rand(Random.MersenneTwister(4321)), "f" => f1)
produce_or_load(sim_large_c, config, path; filename = hash, pol_kwargs...)
readdir(path)
5-element Vector{String}:
 "sim_large_c_2481395954728946486.jld2"
 "sim_large_c_5275202029616168210.jld2"
 "sim_large_c_8059201649622751010.jld2"
 "sim_large_c_8887498325681316289.jld2"
 "sim_large_c_9332398010852946612.jld2"

yes. But, if we used exactly the same numbers and function, would it yield exactly the same hash code, and hence, not rerun the simulation (as desired)?

config = Dict("x" => rand(Random.MersenneTwister(1234), 1000), "f" => f1)
produce_or_load(sim_large_c, config, path; filename = hash, pol_kwargs...)
readdir(path)
5-element Vector{String}:
 "sim_large_c_2481395954728946486.jld2"
 "sim_large_c_5275202029616168210.jld2"
 "sim_large_c_8059201649622751010.jld2"
 "sim_large_c_8887498325681316289.jld2"
 "sim_large_c_9332398010852946612.jld2"

Perfect!

Be careful of using `hash`.

The limitations of the hash function apply here. For example, custom types should implement == to ensure hash will work as intended. In general using functions with hash should be avoided. Hashing of functions happens on the function name, and hence it doesn't capture information about the actual code of the function or its methods. So this should only be used if the functions are well-established names coming from e.g. Base Julia such as sin, cos, .... You also cannot use anonymous functions at all, as they do not have the same hash even when defined in the the same way but in different Julia sessions.

Preparing & running jobs

Preparing the dictionaries

Here is a shortened script from a project that uses dict_list:

using DrWatson

general_args = Dict(
    "model" => ["barkley", "kuramoto"],
    "noise" => 0.075,
    "noisy_training" => [true, false],
    "N" => [100],
    "embedding" => [ #(γ, τ, r, c)
    (4, 5, 1, 0.34), (4, 6, 1, 0.28)]
)
Dict{String, Any} with 5 entries:
  "embedding"      => [(4, 5, 1, 0.34), (4, 6, 1, 0.28)]
  "model"          => ["barkley", "kuramoto"]
  "N"              => [100]
  "noise"          => 0.075
  "noisy_training" => Bool[1, 0]
dicts = dict_list(general_args)
println("Total dictionaries made: ", length(dicts))
dicts[1]
Dict{String, Any} with 5 entries:
  "embedding"      => (4, 5, 1, 0.34)
  "model"          => "barkley"
  "N"              => 100
  "noise"          => 0.075
  "noisy_training" => true

Also, using the type Derived, we can have parameters that are computed depending on the value of other parameters:

using DrWatson

general_args2 = Dict(
    "model" => "barkley",
    "noise" => [0.075, 0.050, 0.025],
    "noise2" => [1.0, Derived(["noise", "N"], (x,y) -> 2x + y)],
    "noisy_training" => true,
    "N" => 100,
)
Dict{String, Any} with 5 entries:
  "noise2"         => Any[1.0, Derived{String}(["noise", "N"], #3)]
  "model"          => "barkley"
  "N"              => 100
  "noise"          => [0.075, 0.05, 0.025]
  "noisy_training" => true
dicts2 = dict_list(general_args2)
println("Total dictionaries made: ", length(dicts2))
dicts2[1]
Dict{String, Any} with 5 entries:
  "noise2"         => 1.0
  "model"          => "barkley"
  "N"              => 100
  "noise"          => 0.075
  "noisy_training" => true

Now, how you use these dictionaries is up to you. Typically each dictionary is given to a main-like Julia function which extracts the necessary data and calls the necessary functions.

Let's say I have written a function that takes in one of these dictionaries and saves the file somewhere locally:

function cross_estimation(data)
    γ, τ, r, c = data["embedding"]
    N = data["N"]
    # add fake results:
    data["x"] = rand()
    data["error"] = rand(10)
    # Save data:
    prefix = datadir("results", data["model"])
    get(data, "noisy_training", false) && (prefix *= "_noisy")
    get(data, "symmetric_training", false) && (prefix *= "_symmetric")
    sname = savename((@dict γ τ r c N), "jld2")
    mkpath(datadir("results", data["model"]))
    save(datadir("results", data["model"], sname), data)
    return true
end
cross_estimation (generic function with 1 method)

Using map and pmap

One way to run many simulations is with map (identical process for using pmap). To run all my simulations I just do:

dicts = dict_list(general_args)
map(cross_estimation, dicts) # or pmap

# load one of the files to be sure everything is ok:
filename = readdir(datadir("results", "barkley"))[1]
file = load(datadir("results", "barkley", filename))
Dict{String, Any} with 7 entries:
  "embedding"      => (4, 6, 1, 0.28)
  "model"          => "barkley"
  "N"              => 100
  "x"              => 0.950867
  "error"          => [0.987668, 0.528195, 0.804356, 0.027385, 0.656238, 0.3454…
  "noise"          => 0.075
  "noisy_training" => false

Using a Serial Cluster

In case that I can't store the results of dict_list in memory, I have to change my approach and load them from disk. This is easy with the function tmpsave.

Instead of using Julia to run all jobs from one process with map/pmap one can use Julia to submit many jobs to a cluster que. For our example above, the Julia program that does this would look like this:

dicts = dict_list(general_args)
res = tmpsave(dicts)
for r in res
    submit = `qsub -q queuename julia runjob.jl $r`
    run(submit)
end

Now the file runjob.jl would have contents that look like:

f = ARGS[1]
dict = load(projectdir("_research", "tmp", f), "params")
cross_estimation(dict)

i.e. it just loads the dict and straightforwardly uses the "main" function cross_estimation. Remember to routinely clear the tmp directory! You could do that by e.g. adding a line rm(projectdir("_research", "tmp", f) at the end of the runjob.jl script.

Listing completed runs

Continuing from the Preparing & running jobs section, we now want to collect the results of all these simulations into a single DataFrame. We will do that with the function collect_results!.

It is quite simple actually! But because we don't want to include the error, we have to black-list it:

using DataFrames # this is necessary to access collect_results!
bl = ["error"]
res = collect_results!(datadir("results"); black_list = bl, subfolders = true)
4×7 DataFrame
RowembeddingmodelNxnoisenoisy_trainingpath
Tuple…?String?Int64?Float64?Float64?Bool?String?
1(4, 6, 1, 0.28)barkley1000.9508670.075false/home/runner/work/DrWatson.jl/DrWatson.jl/docs/data/results/barkley/N=100_c=0.28_r=1_γ=4_τ=6.jld2
2(4, 5, 1, 0.34)barkley1000.3480840.075false/home/runner/work/DrWatson.jl/DrWatson.jl/docs/data/results/barkley/N=100_c=0.34_r=1_γ=4_τ=5.jld2
3(4, 6, 1, 0.28)kuramoto1000.1048270.075false/home/runner/work/DrWatson.jl/DrWatson.jl/docs/data/results/kuramoto/N=100_c=0.28_r=1_γ=4_τ=6.jld2
4(4, 5, 1, 0.34)kuramoto1000.0465060.075false/home/runner/work/DrWatson.jl/DrWatson.jl/docs/data/results/kuramoto/N=100_c=0.34_r=1_γ=4_τ=5.jld2

We can take also advantage of the basic processing functionality of collect_results! to use the excluded "error" column, replacing it with its average value:

using Statistics: mean
special_list = [:avrg_error => data -> mean(data["error"])]
res = collect_results(
      datadir("results"),
      black_list = bl,
      special_list = special_list,
      subfolders = true
)

select!(res, Not(:path)) # don't show path this time
4×7 DataFrame
RowembeddingmodelNxnoisenoisy_trainingavrg_error
Tuple…?String?Int64?Float64?Float64?Bool?Float64?
1(4, 6, 1, 0.28)barkley1000.9508670.075false0.695379
2(4, 5, 1, 0.34)barkley1000.3480840.075false0.465687
3(4, 6, 1, 0.28)kuramoto1000.1048270.075false0.619848
4(4, 5, 1, 0.34)kuramoto1000.0465060.075false0.543786

As you see here we used collect_results instead of the in-place version, since there already exists a DataFrame with all results processed (and thus everything would be skipped).

Adapting to new data/parameters

We once again continue from the above example. But we now need to run some new simulations with some new parameters that do not exist in the old simulations... Well, DrWatson says "no problem!" :)

Let's save these new parameters in a different subfolder, to have a neatly organized project:

general_args_new = Dict(
    "model" => ["bocf"],
    "symmetry" => "radial",
    "symmetric_training" => [true, false],
    "N" => [100],
    "embedding" => [ #(γ, τ, r, c)
    (4, 5, 1, 0.34), (4, 6, 1, 0.28)]
)
Dict{String, Any} with 5 entries:
  "symmetry"           => "radial"
  "model"              => ["bocf"]
  "symmetric_training" => Bool[1, 0]
  "N"                  => [100]
  "embedding"          => [(4, 5, 1, 0.34), (4, 6, 1, 0.28)]

As you can see, there here there are two parameters not existing in previous simulations, namely "symmetry", "symmetric_training". In addition, the parameters "noise", "noisy_training" that existed in the previous simulations do not exist in the current one.

No problem though, let's run the new simulations:

dicts = dict_list(general_args_new)
map(cross_estimation, dicts)

# load one of the files to be sure everything is ok:
filename = readdir(datadir("results", "bocf"))[1]
file = load(datadir("results", "bocf", filename))
Dict{String, Any} with 7 entries:
  "symmetric_training" => false
  "model"              => "bocf"
  "N"                  => 100
  "embedding"          => (4, 6, 1, 0.28)
  "symmetry"           => "radial"
  "x"                  => 0.998601
  "error"              => [0.719395, 0.118252, 0.88885, 0.344028, 0.0396596, 0.…

Alright, now we want to add these new runs to our existing dataframe that has collected all previous results. This is straight-forward:

res = collect_results!(datadir("results"); black_list = bl, subfolders = true)

select!(res, Not(:path)) # don't show path this time
6×8 DataFrame
RowembeddingmodelNxnoisenoisy_trainingsymmetric_trainingsymmetry
Tuple…?String?Int64?Float64?Float64?Bool?Bool?String?
1(4, 6, 1, 0.28)barkley1000.9508670.075falsemissingmissing
2(4, 5, 1, 0.34)barkley1000.3480840.075falsemissingmissing
3(4, 6, 1, 0.28)kuramoto1000.1048270.075falsemissingmissing
4(4, 5, 1, 0.34)kuramoto1000.0465060.075falsemissingmissing
5(4, 6, 1, 0.28)bocf1000.998601missingmissingfalseradial
6(4, 5, 1, 0.34)bocf1000.0520732missingmissingfalseradial

All missing entries were adjusted automatically :)

Defining parameter sets with restrictions

As already demonstrated in the examples above, for functions where the set of input parameters is the same for each simulation run, a basic dictionary can be used to define these parameters. However, often some of the parameters or values should only be considered if another parameter is also included in the set or has a specific value. The macro @onlyif allows to place such restrictions on values and parameters. The following dictionary defines values and parameters for a genetic algorithm:

ga_parameters = Dict(
    :population_size => [20,50,100],
    :selection => ["roulette-selection", "SUS", "tournament-selection", "linear ranking"],
    :fitness_scaling => @onlyif(:selection in ("SUS", "roulette-selection"), collect(1.0:20.0)),
    :tournamet_size => @onlyif(:selection == "tournament-selection", collect(2:10)),
    :chromosome => [:A, @onlyif(begin
        size_constr = (:population_size <= 50)
        select_constr = (:selection != "SUS")
        size_constr && select_constr
    end, :B)])
Dict{Symbol, Vector} with 5 entries:
  :selection       => ["roulette-selection", "SUS", "tournament-selection", "li…
  :population_size => [20, 50, 100]
  :chromosome      => Any[:A, DependentParameter{Symbol}(:B, #9)]
  :fitness_scaling => DependentParameter{Float64}[DependentParameter{Float64}(1…
  :tournamet_size  => DependentParameter{Int64}[DependentParameter{Int64}(2, #8…
dicts = dict_list(ga_parameters)
length(dicts)
210
dicts[1]
Dict{Symbol, Any} with 4 entries:
  :selection       => "roulette-selection"
  :population_size => 20
  :chromosome      => :A
  :fitness_scaling => 1.0

The parameter restriction for the chromosome type shows that one can use arbitrary Julia expressions that return true or false. In this case, first the conditions for the population size and for the selection method are evaluated and stored. The expression then only returns true, if both conditions are met, thus restricting the usage of chromosome type :B.

As @onlyif is meant to be used with dict_list, it supports the vector notation used for defining possible parameter values. This is achieved by automatically broadcasting every @onlyif call over Vector arguments, which allows chaining those calls to combine conditions. So in terms of the result, @onlyif( :a == 2, [5, @onlyif(:b == 4, 6)]) is equivalent to [@onlyif( :a == 2, 5), @onlyif(:a == 2 && :b == 4, 6)].

Filtering by name with collect_results

Using collect_results on a folder with many (e.g. 1,000) files in it can be noticeably slow. To speed this up, you can use the rinclude and rexclude keyword arguments, both of which are vectors of Regex expressions. The results returned will have a filename which matches any of the Regex expressions in rinclude and does not match any of the Regex expressions in rexclude.

df = collect_results(datadir("results"); rinclude=[r"a=1"])
# Only include results whose filename contains "a=1"

df = collect_results(datadir("results"); rexclude=[r"a=3"])
# Exclude any results whose filename contains "a=3"

df = collect_results(datadir("results"); rinclude=[r"a=1", r"b=5"], rexclude=[r"a=3"])
# Only include results whose filename contains "a=1" OR "b=5" and exclude any which contain "a=3"

Advanced usage of collect_results

At some point in your work you may want to run a single function that returns multiple fields that you want to include in your results DataFrame. Depending on the problem you are trying to solve it may just make more sense to use a single function that extracts most or all of the meta-data. For this case DrWatson has another syntax available. Let us, for the sake of simplicity, assume that your data files contain a very long array of numbers called "manynumbers" and the information that you care about are the three largest values.

One way to implement this would be to write

special_list = [
    :first  => data -> sort(data["manynumbers"])[1],
    :second => data -> sort(data["manynumbers"])[2],
    :third  => data -> sort(data["manynumbers"])[3],
    ]

which makes very obvious that there should be a better way to do this. There is no point in sorting the very long vector three times. A better thing to do is the following

function largestthree(data)
    sorted = sort(data["manynumbers"])
    return [:first  => sorted[1],
            :second => sorted[2],
            :third  => sorted[3]]
end

special_list = [largestthree,]

Using savename to produce logfiles

When your code runs for a long time or even runs on different machines such as a cluster environment it becomes important to produce logfiles. Logfiles allow you to view the progress of your program while it is still running, or check later on if everything went according to plan.

using Dates

function logmessage(n, error)
    # current time
    time = Dates.format(now(UTC), dateformat"yyyy-mm-dd HH:MM:SS")

    # memory the process is using
    maxrss = "$(round(Sys.maxrss()/1048576, digits=2)) MiB"

    logdata = (;
        n, # iteration n
        error, # some super important progress update
        maxrss) # lastly the amount of memory being used

    println(savename(time, logdata; connector=" | ", equals=" = ", sort=false, digits=2))
end

function expensive_computation(N)

    for n = 1:N
        sleep(1) # heavy computation
        error = rand()/n # some super import progress update
        logmessage(n, error)
    end

end

This yields output that is both easy to read and machine parseable. If you ever end up with too many logfiles to read, there is still parse_savename to help you.

julia> expensive_computation(5)
2021-05-19 19:20:25 | n = 1 | error = 0.65 | maxrss = 326.27 MiB
2021-05-19 19:20:26 | n = 2 | error = 0.48 | maxrss = 326.27 MiB
2021-05-19 19:20:27 | n = 3 | error = 0.08 | maxrss = 326.27 MiB
2021-05-19 19:20:28 | n = 4 | error = 0.11 | maxrss = 326.27 MiB
2021-05-19 19:20:29 | n = 5 | error = 0.15 | maxrss = 326.27 MiB

Taking project input-output automation to 11

The point of this section is to show how far one can take the interplay between savename and produce_or_load to automate project input-to-output and eliminate as many duplicate lines of code as possible. Read Customizing savename first, as knowledge of that section is used here.

The key ingredient is that produce_or_load was made to work well with savename. You can use this to automate the input-to-output pipeline of your project by following these steps:

  1. Define a custom struct that represents the input configuration for an experiment or a simulation.
  2. Extend savename appropriately for it.
  3. Define a "main" function that takes as input an instance of this configuration type, and returns the output of the experiment or simulation as dictionary (We're not changing here the "default" way to save files in Julia as .jld2 files. To save files this way you need your data to be in a dictionary with String as keys).
  4. All your input-output scripts are simply put together by first defining the input configuration type, and then calling produce_or_load with your pre-defined "main" function (Alternatively, this function can internally call produce_or_load and return something else that is of special interest to your specific case).

An example of where this approach is used in the "real world" is e.g. in our paper Effortless estimation of basins of attraction. Its codebase is here: https://github.com/Datseris/EffortlessBasinsOfAttraction. Don't worry, you need to know nothing about the topic to follow the rest. The point is that we needed to run some kind of simulations for many different dynamical systems, which have different parameters, different dimensionality, etc. But they did have one thing in common: our output was always coming from the same function, basins_of_attraction, which allowed using the pipeline we discuss here using produce_or_load.

So we defined a struct called BasinConfig that stored configuration options and system parameters. Then we extended savename for it. We defined some function produce_basins that takes this configuration file, initializes a dynamical system accordingly, and then makes the output using produce_or_load. This ensures that we're not running simulations twice if they exist. And keep in mind when you have so many parameters and different possible systems, it is quite easy to unintentionally run the same simulation twice because you "forgot about it". All of this can be found in this file: https://github.com/Datseris/EffortlessBasinsOfAttraction/blob/master/src/produce_basins.jl

The benefit? All of our scripts that actually produce what we care about are this short:

using DrWatson
@quickactivate :EffortlessBasinsOfAttraction

a, b = 1.4, 0.3
p = @ntuple a b
system = :henon

basin_kwargs = (horizon_limit=100.0, mx_chk_fnd_att=30, mx_chk_lost=2)
Z = 201
xg = range(-1.5, 1.5; length = Z)
yg = range(-0.5, 0.5; length = Z)
grid = (xg, yg)

config = BasinConfig(; system, p, basin_kwargs, grid)
basins, attractors = produce_basins(config)

and more importantly, the only lines that are genuinely "copy-pasted" from script to script are the last two. All other lines are unique for each script. This minimization of copy-pasting duplicate information makes the workflow robust and makes bugs easier to find.