Real World Examples
Easy local directories
I setup all my science projects using DrWatson's suggested setup, using initialize_project
. Then, every file in every project has a start that looks like this:
using DrWatson
quickactivate(@__DIR__, "MagneticBilliardsLyapunovs")
using DynamicalBilliards, PyPlot, LinearAlgebra
include(srcdir()*"plot_perturbationgrowth.jl")
include(srcdir()*"unitcells.jl")
In all projects I save data/plots using datadir/plotdir
:
tagsave(datadir()*"mushrooms/Λ_N=$N.bson", (@dict Λ Λσ ws hs description))
The advantage of this approach is that it will always work regardless of if I move the specific file to a different subfolder (which is very often necessary) or whether I move the entire project folder somewhere else! Please be sure you have understood the caveat of using quickactivate
!
Here is an example from another project. You will notice that another advantage is that I can use identical syntax to access the data or source folders even though I have different projects!
using DrWatson
quickactivate(@__DIR__, "EmbeddingResearch")
using Parameters
using TimeseriesPrediction, LinearAlgebra, Statistics
include(srcdir()*"systems/barkley.jl")
include(srcdir()*"nrmse.jl")
that ends with
tagsave(
savename(datadir()*"sim/bk", simulation, "jld2"),
@strdict U V simulation
)
savename
and tagging
The combination of using savename
and tagsave
makes it easy and fast to save output in a way that is consistent, robust and reproducible. Here is an example from a project:
using DrWatson
quickactivate(@__DIR__, "EmbeddingResearch")
using TimeseriesPrediction, LinearAlgebra, Statistics
include(srcdir()*"systems/barkley.jl")
ΔTs = [1.0, 0.5, 0.1] # resolution of the saved data
Ns = [50, 150] # spatial extent
for N ∈ Ns, ΔT ∈ ΔTs
T = 10050 # we can offset up to 1000 units
every = round(Int, ΔT/barkley_Δt)
seed = 1111
simulation = @ntuple T N ΔT seed
U, V = barkley(T, N, every; seed = seed)
tagsave(
savename(datadir()*"sim/bk", simulation, "bson"),
@dict U V simulation
)
end
This saves files that look like:
path/to/project/data/sim/bk_N=50_T=10050_seed=1111_ΔT=1.bson
and each file is a dictionary with four fields: :U, :V, :simulation, :commit
. When I read this file I know exactly what was the source code that produced it (provided that I am not sloppy and commit code changes regularly :P).
Customizing savename
Here is a simple (but not from a real project) example for customizing savename
. We are using a common struct Experiment
across different experiments with cats and mice. In this example we are also using Parameters.jl for a convenient default constructor.
We first define the relevant types.
using DrWatson, Parameters, Dates
# Define a type hierarchy we use at experiments
abstract type Species end
struct Mouse <: Species end
struct Cat <: Species end
@with_kw struct Experiment{S<:Species}
n::Int = 50
c::Float64 = 10.0
x::Float64 = 0.2
date::Date = Date(Dates.now())
species::S = Mouse()
scientist::String = "George"
end
e1 = Experiment()
e2 = Experiment(species = Cat())
Main.ex-customizing.Experiment{Main.ex-customizing.Cat}
n: Int64 50
c: Float64 10.0
x: Float64 0.2
date: Date
species: Main.ex-customizing.Cat Main.ex-customizing.Cat()
scientist: String "George"
For analyzing our experiments we need information about the species used, and to use multiple dispatch latter on we decided to make this information associated with a Type.
Now, we want to customize savename
. We start by extending DrWatson.default_prefix
:
DrWatson.default_prefix(e::Experiment) = "Experiment_"*string(e.date)
savename(e1)
"Experiment_2019-05-03_c=10_n=50_scientist=George_x=0.2"
However this is not good enough for us, as the information about the species is not contained in savename
. We have to extend DrWatson.default_allowed
like so:
DrWatson.default_allowed(::Experiment) = (Real, String, Species)
savename(e1)
"Experiment_2019-05-03_c=10_n=50_scientist=George_species=Main.ex-customizing.Mouse()_x=0.2"
To make printing better we can extend Base.string
, which is what DrWatson uses internally in savename
to display values.
Base.string(::Mouse) = "mouse"
Base.string(::Cat) = "cat"
Lastly, let's say that the information of which scientist performed the experiment is not really relevant for savename
. We can extend the last method, DrWatson.allaccess
:
DrWatson.allaccess(::Experiment) = (:n, :c, :x, :species)
so that only those four fields will be used (notice that the date
field is anyway used in default_prefix
). We finally have:
println( savename(e1) )
println( savename(e2) )
Experiment_2019-05-03_c=10_n=50_species=mouse_x=0.2
Experiment_2019-05-03_c=10_n=50_species=cat_x=0.2
Stopping "Did I run this?"
It can become very tedious to have a piece of code that you may or may not have run and may or may not have saved the produced data. You then constantly ask yourself "Did I run this?". Typically one uses isfile
and an if
clause to either load a file or run some code. Especially in the cases where the code takes only a couple of minutes to finish you are left in a dilemma "Is it even worth it to save?".
This is the dilemma that produce_or_load
resolves. You can wrap your code in a function and then produce_or_load
will take care of the rest for you! I found it especially useful in scripts that generate figures for a publication.
Here is an example; originally I had this piece of code:
HTEST = 0.1:0.1:2.0
WS = [0.5, 1.0, 1.5]
N = 10000; T = 10000.0
toypar_h = [[] for l in HS]
for (wi, w) in enumerate(WS)
println("w = $w")
for h in HTEST
toyp = toyparameters(h, w, N, T)
push!(toypar_h[wi], toyp)
end
end
that was taking some minutes to run. To use the function produce_or_load
I first have to wrap this code in a high level function like so:
HTEST = 0.1:0.1:2.0
WS = [0.5, 1.0, 1.5]
function g(d)
@unpack N, T = d
toypar_h = [[] for l in HS]
for (wi, w) in enumerate(WS)
println("w = $w")
for h in HTEST
toyp = toyparameters(h, w, N, T)
push!(toypar_h[wi], toyp)
end
end
return @dict toypar_h
end
N = 2000; T = 2000.0
file = produce_or_load(
datadir()*"mushrooms/toy", # prefix
@dict(N, T), # container
g # function
)
@unpack toypar_h = file
Now, every time I run this code block the function tests automatically whether the file exists. Only if it does not then the code is run.
The extra step is that I have to extract the useful data I need from the container file
. Thankfully the @unpack
macro from Parameters.jl makes this super easy.
Preparing & running jobs
Preparing the dictionaries
Here is a shortened script from a project that uses dict_list
:
using DrWatson
general_args = Dict(
"model" => ["barkley", "kuramoto"],
"noise" => 0.075,
"noisy_training" => [true, false],
"N" => [100],
"embedding" => [ #(γ, τ, r, c)
(4, 5, 1, 0.34), (4, 6, 1, 0.28)]
)
Dict{String,Any} with 5 entries:
"embedding" => Tuple{Int64,Int64,Int64,Float64}[(4, 5, 1, 0.34), (4, 6, …
"model" => ["barkley", "kuramoto"]
"N" => [100]
"noise" => 0.075
"noisy_training" => Bool[true, false]
dicts = dict_list(general_args)
println("Total dictionaries made: ", length(dicts))
dicts[1]
Dict{String,Any} with 5 entries:
"embedding" => (4, 5, 1, 0.34)
"model" => "barkley"
"N" => 100
"noise" => 0.075
"noisy_training" => true
Now how you use these dictionaries is up to you. Typically each dictionary is given to a main
-like Julia function which extracts the necessary data and calls the necessary functions.
Let's say I have written a function that takes in one of these dictionaries and saves the file somewhere locally:
function cross_estimation(data)
γ, τ, r, c = data["embedding"]
N = data["N"]
# add fake results:
data["x"] = rand()
data["error"] = rand(10)
# Save data:
prefix = datadir()*"results/"*data["model"]
get(data, "noisy_training", false) && (prefix *= "_noisy")
save(
savename(
prefix,
(@dict γ τ r c N),
"bson"
),
data
)
return true
end
cross_estimation (generic function with 1 method)
Using map and pmap
One way to run many simulations is with map
(identical process for using pmap
). To run all my simulations I just do:
mkpath(datadir()*"results")
dicts = dict_list(general_args)
map(cross_estimation, dicts) # or pmap
# load one of the files to be sure everything is ok:
filename = readdir(datadir()*"results")[1]
file = load(datadir()*"results/"*filename)
Dict{String,Any} with 7 entries:
"embedding" => (4, 6, 1, 0.28)
"model" => "barkley"
"N" => 100
"x" => 0.809
"error" => [0.882604, 0.520829, 0.26743, 0.861884, 0.0256479, 0.0888…
"noise" => 0.075
"noisy_training" => false
Using a Serial Cluster
In case that I can't store the results of dict_list
in memory, I have to change my approach and load them from disk. This is easy with the function tmpsave
.
Instead of using Julia to run all jobs from one process with map/pmap
one can use Julia to submit many jobs to a cluster que. For our example above, the Julia program that does this would look like this:
dicts = dict_list(general_args)
res = tmpsave(dicts)
for r in res
submit = `qsub -q queuename julia runjob.jl $r`
run(submit)
end
Now the file runjob.jl
would have contents that look like:
f = ARGS[1]
dict = load(projectdir("_research/tmp/")*f)
cross_estimation(dict)
i.e. it just loads the dict
and straightforwardly uses the "main" function cross_estimation
. Remember to routinely clear the tmp
directory! You could do that by e.g. adding a line rm(projectdir("_research/tmp/")*f)
at the end of the runjob.jl
script.
Listing completed runs
Continuing from the Preparing & running jobs section, we now want to collect the results of all these simulations into a single DataFrame
. We will do that with the function collect_results!
.
It is quite simple actually! But because we don't want to include the error, we have to black-list it:
using DataFrames # this is necessary to access collect_results!
black_list = ["error"]
res = collect_results!(datadir()*"results"; black_list = black_list)
embedding | model | N | x | noise | noisy_training | path | |
---|---|---|---|---|---|---|---|
Tuple…⍰ | String⍰ | Int64⍰ | Float64⍰ | Float64⍰ | Bool⍰ | String⍰ | |
1 | (4, 6, 1, 0.28) | barkley | 100 | 0.809 | 0.075 | false | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_N=100_c=0.28_r=1_γ=4_τ=6.bson |
2 | (4, 5, 1, 0.34) | barkley | 100 | 0.803821 | 0.075 | false | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_N=100_c=0.34_r=1_γ=4_τ=5.bson |
3 | (4, 6, 1, 0.28) | barkley | 100 | 0.898505 | 0.075 | true | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_noisy_N=100_c=0.28_r=1_γ=4_τ=6.bson |
4 | (4, 5, 1, 0.34) | barkley | 100 | 0.441528 | 0.075 | true | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_noisy_N=100_c=0.34_r=1_γ=4_τ=5.bson |
5 | (4, 6, 1, 0.28) | kuramoto | 100 | 0.787875 | 0.075 | false | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_N=100_c=0.28_r=1_γ=4_τ=6.bson |
6 | (4, 5, 1, 0.34) | kuramoto | 100 | 0.65177 | 0.075 | false | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_N=100_c=0.34_r=1_γ=4_τ=5.bson |
7 | (4, 6, 1, 0.28) | kuramoto | 100 | 0.5121 | 0.075 | true | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_noisy_N=100_c=0.28_r=1_γ=4_τ=6.bson |
8 | (4, 5, 1, 0.34) | kuramoto | 100 | 0.408979 | 0.075 | true | /home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_noisy_N=100_c=0.34_r=1_γ=4_τ=5.bson |
We can take also advantage of the basic processing functionality of collect_results!
to use the excluded "error"
column, replacing it with its average value:
using Statistics: mean
special_list = [:avrg_error => data -> mean(data["error"])]
res = collect_results(
datadir()*"results",
black_list = black_list,
special_list = special_list
)
deletecols!(res, :path) # don't show path this time
embedding | model | N | x | noise | noisy_training | avrg_error | |
---|---|---|---|---|---|---|---|
Tuple…⍰ | String⍰ | Int64⍰ | Float64⍰ | Float64⍰ | Bool⍰ | Float64⍰ | |
1 | (4, 6, 1, 0.28) | barkley | 100 | 0.809 | 0.075 | false | 0.36865 |
2 | (4, 5, 1, 0.34) | barkley | 100 | 0.803821 | 0.075 | false | 0.438349 |
3 | (4, 6, 1, 0.28) | barkley | 100 | 0.898505 | 0.075 | true | 0.685996 |
4 | (4, 5, 1, 0.34) | barkley | 100 | 0.441528 | 0.075 | true | 0.316476 |
5 | (4, 6, 1, 0.28) | kuramoto | 100 | 0.787875 | 0.075 | false | 0.591969 |
6 | (4, 5, 1, 0.34) | kuramoto | 100 | 0.65177 | 0.075 | false | 0.631568 |
7 | (4, 6, 1, 0.28) | kuramoto | 100 | 0.5121 | 0.075 | true | 0.709027 |
8 | (4, 5, 1, 0.34) | kuramoto | 100 | 0.408979 | 0.075 | true | 0.574161 |
As you see here we used collect_results
instead of the in-place version, since there already exists a DataFrame
with all results processed (and thus everything would be skipped).
Adapting to new data/parameters
We once again continue from the above example. But we suddenly realize that we need to run some new simulations with some new parameters that do not exist in the old simulations... Well, DrWatson says "no problem!" :)
Let's save these new parameters in a different subfolder, to have a neatly organized project:
general_args_new = Dict(
"model" => ["bocf"],
"symmetry" => "radial",
"symmetric_training" => [true, false],
"N" => [100],
"embedding" => [ #(γ, τ, r, c)
(4, 5, 1, 0.34), (4, 6, 1, 0.28)]
)
Dict{String,Any} with 5 entries:
"symmetry" => "radial"
"model" => ["bocf"]
"symmetric_training" => Bool[true, false]
"N" => [100]
"embedding" => Tuple{Int64,Int64,Int64,Float64}[(4, 5, 1, 0.34), (4,…
As you can see, there here there are two parameters not existing in previous simulations, namely "symmetry", "symmetric_training"
. In addition, the parameters "noise", "noisy_training"
that existed in the previous simulations do not exist in the current one.
No problem though, let's run the new simulations:
mkpath(datadir()*"results/sym")
function cross_estimation_new(data)
γ, τ, r, c = data["embedding"]
N = data["N"]
# add fake results:
data["x"] = rand()
data["error"] = rand(10)
# Save data:
prefix = datadir()*"results/sym/"*data["model"]
get(data, "symmetric_training", false) && (prefix *= "_symmetric")
save(
savename(
prefix,
(@dict γ τ r c N),
"bson"
),
data
)
return true
end
dicts = dict_list(general_args_new)
map(cross_estimation_new, dicts)
# load one of the files to be sure everything is ok:
filename = readdir(datadir()*"results/sym")[1]
file = load(datadir()*"results/sym/"*filename)
Dict{String,Any} with 7 entries:
"symmetry" => "radial"
"symmetric_training" => false
"model" => "bocf"
"N" => 100
"embedding" => (4, 6, 1, 0.28)
"x" => 0.176957
"error" => [0.498813, 0.148692, 0.323359, 0.835796, 0.692457, 0.…
Alright, now we want to add these new runs to our existing dataframe that has collected all previous results. This is straight-forward:
res = collect_results!(datadir()*"results";
black_list = black_list, subfolders = true)
deletecols!(res, :path) # don't show path
embedding | model | N | x | noise | noisy_training | symmetry | symmetric_training | |
---|---|---|---|---|---|---|---|---|
Any | Any | Any | Any | Any | Any | String⍰ | Bool⍰ | |
1 | (4, 6, 1, 0.28) | barkley | 100 | 0.809 | 0.075 | false | missing | missing |
2 | (4, 5, 1, 0.34) | barkley | 100 | 0.803821 | 0.075 | false | missing | missing |
3 | (4, 6, 1, 0.28) | barkley | 100 | 0.898505 | 0.075 | true | missing | missing |
4 | (4, 5, 1, 0.34) | barkley | 100 | 0.441528 | 0.075 | true | missing | missing |
5 | (4, 6, 1, 0.28) | kuramoto | 100 | 0.787875 | 0.075 | false | missing | missing |
6 | (4, 5, 1, 0.34) | kuramoto | 100 | 0.65177 | 0.075 | false | missing | missing |
7 | (4, 6, 1, 0.28) | kuramoto | 100 | 0.5121 | 0.075 | true | missing | missing |
8 | (4, 5, 1, 0.34) | kuramoto | 100 | 0.408979 | 0.075 | true | missing | missing |
9 | (4, 6, 1, 0.28) | bocf | 100 | 0.176957 | missing | missing | radial | false |
10 | (4, 5, 1, 0.34) | bocf | 100 | 0.00366527 | missing | missing | radial | false |
11 | (4, 6, 1, 0.28) | bocf | 100 | 0.0268539 | missing | missing | radial | true |
12 | (4, 5, 1, 0.34) | bocf | 100 | 0.850146 | missing | missing | radial | true |
(subfolders = true
ensures that we scan the new data)
All missing
entries were adjusted automatically :)