Real World Examples

Real World Examples

Easy local directories

I setup all my science projects using DrWatson's suggested setup, using initialize_project. Then, every file in every project has a start that looks like this:

using DrWatson
quickactivate(@__DIR__, "MagneticBilliardsLyapunovs")
using DynamicalBilliards, PyPlot, LinearAlgebra

include(srcdir()*"plot_perturbationgrowth.jl")
include(srcdir()*"unitcells.jl")

In all projects I save data/plots using datadir/plotdir:

tagsave(datadir()*"mushrooms/Λ_N=$N.bson", (@dict Λ Λσ ws hs description))

The advantage of this approach is that it will always work regardless of if I move the specific file to a different subfolder (which is very often necessary) or whether I move the entire project folder somewhere else! Please be sure you have understood the caveat of using quickactivate!

Here is an example from another project. You will notice that another advantage is that I can use identical syntax to access the data or source folders even though I have different projects!

using DrWatson
quickactivate(@__DIR__, "EmbeddingResearch")
using Parameters
using TimeseriesPrediction, LinearAlgebra, Statistics

include(srcdir()*"systems/barkley.jl")
include(srcdir()*"nrmse.jl")

that ends with

tagsave(
    savename(datadir()*"sim/bk", simulation, "jld2"),
    @strdict U V simulation
)

savename and tagging

The combination of using savename and tagsave makes it easy and fast to save output in a way that is consistent, robust and reproducible. Here is an example from a project:

using DrWatson
quickactivate(@__DIR__, "EmbeddingResearch")
using TimeseriesPrediction, LinearAlgebra, Statistics
include(srcdir()*"systems/barkley.jl")

ΔTs = [1.0, 0.5, 0.1] # resolution of the saved data
Ns = [50, 150] # spatial extent
for N ∈ Ns, ΔT ∈ ΔTs
    T = 10050 # we can offset up to 1000 units
    every = round(Int, ΔT/barkley_Δt)
    seed = 1111

    simulation = @ntuple T N ΔT seed
    U, V = barkley(T, N, every; seed = seed)

    tagsave(
        savename(datadir()*"sim/bk", simulation, "bson"),
        @dict U V simulation
    )
end

This saves files that look like:

path/to/project/data/sim/bk_N=50_T=10050_seed=1111_ΔT=1.bson

and each file is a dictionary with four fields: :U, :V, :simulation, :commit. When I read this file I know exactly what was the source code that produced it (provided that I am not sloppy and commit code changes regularly :P).

Customizing savename

Here is a simple (but not from a real project) example for customizing savename. We are using a common struct Experiment across different experiments with cats and mice. In this example we are also using Parameters.jl for a convenient default constructor.

We first define the relevant types.

using DrWatson, Parameters, Dates

# Define a type hierarchy we use at experiments
abstract type Species end
struct Mouse <: Species end
struct Cat <: Species end

@with_kw struct Experiment{S<:Species}
    n::Int = 50
    c::Float64 = 10.0
    x::Float64 = 0.2
    date::Date = Date(Dates.now())
    species::S = Mouse()
    scientist::String = "George"
end

e1 = Experiment()
e2 = Experiment(species = Cat())
Main.ex-customizing.Experiment{Main.ex-customizing.Cat}
  n: Int64 50
  c: Float64 10.0
  x: Float64 0.2
  date: Date
  species: Main.ex-customizing.Cat Main.ex-customizing.Cat()
  scientist: String "George"

For analyzing our experiments we need information about the species used, and to use multiple dispatch latter on we decided to make this information associated with a Type.

Now, we want to customize savename. We start by extending DrWatson.default_prefix:

DrWatson.default_prefix(e::Experiment) = "Experiment_"*string(e.date)

savename(e1)
"Experiment_2019-05-03_c=10_n=50_scientist=George_x=0.2"

However this is not good enough for us, as the information about the species is not contained in savename. We have to extend DrWatson.default_allowed like so:

DrWatson.default_allowed(::Experiment) = (Real, String, Species)

savename(e1)
"Experiment_2019-05-03_c=10_n=50_scientist=George_species=Main.ex-customizing.Mouse()_x=0.2"

To make printing better we can extend Base.string, which is what DrWatson uses internally in savename to display values.

Base.string(::Mouse) = "mouse"
Base.string(::Cat) = "cat"

Lastly, let's say that the information of which scientist performed the experiment is not really relevant for savename. We can extend the last method, DrWatson.allaccess:

DrWatson.allaccess(::Experiment) = (:n, :c, :x, :species)

so that only those four fields will be used (notice that the date field is anyway used in default_prefix). We finally have:

println( savename(e1) )
println( savename(e2) )
Experiment_2019-05-03_c=10_n=50_species=mouse_x=0.2
Experiment_2019-05-03_c=10_n=50_species=cat_x=0.2

Stopping "Did I run this?"

It can become very tedious to have a piece of code that you may or may not have run and may or may not have saved the produced data. You then constantly ask yourself "Did I run this?". Typically one uses isfile and an if clause to either load a file or run some code. Especially in the cases where the code takes only a couple of minutes to finish you are left in a dilemma "Is it even worth it to save?".

This is the dilemma that produce_or_load resolves. You can wrap your code in a function and then produce_or_load will take care of the rest for you! I found it especially useful in scripts that generate figures for a publication.

Here is an example; originally I had this piece of code:

HTEST = 0.1:0.1:2.0
WS = [0.5, 1.0, 1.5]
N = 10000; T = 10000.0

toypar_h = [[] for l in HS]
for (wi, w) in enumerate(WS)
    println("w = $w")
    for h in HTEST
        toyp = toyparameters(h, w, N, T)
        push!(toypar_h[wi], toyp)
    end
end

that was taking some minutes to run. To use the function produce_or_load I first have to wrap this code in a high level function like so:

HTEST = 0.1:0.1:2.0
WS = [0.5, 1.0, 1.5]

function g(d)
    @unpack N, T = d

    toypar_h = [[] for l in HS]
    for (wi, w) in enumerate(WS)
        println("w = $w")
        for h in HTEST
            toyp = toyparameters(h, w, N, T)
            push!(toypar_h[wi], toyp)
        end
    end

    return @dict toypar_h
end

N = 2000; T = 2000.0
file = produce_or_load(
    datadir()*"mushrooms/toy", # prefix
    @dict(N, T), # container
    g # function
)
@unpack toypar_h = file

Now, every time I run this code block the function tests automatically whether the file exists. Only if it does not then the code is run.

The extra step is that I have to extract the useful data I need from the container file. Thankfully the @unpack macro from Parameters.jl makes this super easy.

Preparing & running jobs

Preparing the dictionaries

Here is a shortened script from a project that uses dict_list:

using DrWatson

general_args = Dict(
    "model" => ["barkley", "kuramoto"],
    "noise" => 0.075,
    "noisy_training" => [true, false],
    "N" => [100],
    "embedding" => [ #(γ, τ, r, c)
    (4, 5, 1, 0.34), (4, 6, 1, 0.28)]
)
Dict{String,Any} with 5 entries:
  "embedding"      => Tuple{Int64,Int64,Int64,Float64}[(4, 5, 1, 0.34), (4, 6, …
  "model"          => ["barkley", "kuramoto"]
  "N"              => [100]
  "noise"          => 0.075
  "noisy_training" => Bool[true, false]
dicts = dict_list(general_args)
println("Total dictionaries made: ", length(dicts))
dicts[1]
Dict{String,Any} with 5 entries:
  "embedding"      => (4, 5, 1, 0.34)
  "model"          => "barkley"
  "N"              => 100
  "noise"          => 0.075
  "noisy_training" => true

Now how you use these dictionaries is up to you. Typically each dictionary is given to a main-like Julia function which extracts the necessary data and calls the necessary functions.

Let's say I have written a function that takes in one of these dictionaries and saves the file somewhere locally:

function cross_estimation(data)
    γ, τ, r, c = data["embedding"]
    N = data["N"]
    # add fake results:
    data["x"] = rand()
    data["error"] = rand(10)
    # Save data:
    prefix = datadir()*"results/"*data["model"]
    get(data, "noisy_training", false) && (prefix *= "_noisy")
    save(
        savename(
            prefix,
            (@dict γ τ r c N),
            "bson"
            ),
        data
    )
    return true
end
cross_estimation (generic function with 1 method)

Using map and pmap

One way to run many simulations is with map (identical process for using pmap). To run all my simulations I just do:

mkpath(datadir()*"results")
dicts = dict_list(general_args)
map(cross_estimation, dicts) # or pmap

# load one of the files to be sure everything is ok:
filename = readdir(datadir()*"results")[1]
file = load(datadir()*"results/"*filename)
Dict{String,Any} with 7 entries:
  "embedding"      => (4, 6, 1, 0.28)
  "model"          => "barkley"
  "N"              => 100
  "x"              => 0.809
  "error"          => [0.882604, 0.520829, 0.26743, 0.861884, 0.0256479, 0.0888…
  "noise"          => 0.075
  "noisy_training" => false

Using a Serial Cluster

In case that I can't store the results of dict_list in memory, I have to change my approach and load them from disk. This is easy with the function tmpsave.

Instead of using Julia to run all jobs from one process with map/pmap one can use Julia to submit many jobs to a cluster que. For our example above, the Julia program that does this would look like this:

dicts = dict_list(general_args)
res = tmpsave(dicts)
for r in res
    submit = `qsub -q queuename julia runjob.jl $r`
    run(submit)
end

Now the file runjob.jl would have contents that look like:

f = ARGS[1]
dict = load(projectdir("_research/tmp/")*f)
cross_estimation(dict)

i.e. it just loads the dict and straightforwardly uses the "main" function cross_estimation. Remember to routinely clear the tmp directory! You could do that by e.g. adding a line rm(projectdir("_research/tmp/")*f) at the end of the runjob.jl script.

Listing completed runs

Continuing from the Preparing & running jobs section, we now want to collect the results of all these simulations into a single DataFrame. We will do that with the function collect_results!.

It is quite simple actually! But because we don't want to include the error, we have to black-list it:

using DataFrames # this is necessary to access collect_results!
black_list = ["error"]
res = collect_results!(datadir()*"results"; black_list = black_list)

8 rows × 7 columns

embeddingmodelNxnoisenoisy_trainingpath
Tuple…⍰String⍰Int64⍰Float64⍰Float64⍰Bool⍰String⍰
1(4, 6, 1, 0.28)barkley1000.8090.075false/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_N=100_c=0.28_r=1_γ=4_τ=6.bson
2(4, 5, 1, 0.34)barkley1000.8038210.075false/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_N=100_c=0.34_r=1_γ=4_τ=5.bson
3(4, 6, 1, 0.28)barkley1000.8985050.075true/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_noisy_N=100_c=0.28_r=1_γ=4_τ=6.bson
4(4, 5, 1, 0.34)barkley1000.4415280.075true/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/barkley_noisy_N=100_c=0.34_r=1_γ=4_τ=5.bson
5(4, 6, 1, 0.28)kuramoto1000.7878750.075false/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_N=100_c=0.28_r=1_γ=4_τ=6.bson
6(4, 5, 1, 0.34)kuramoto1000.651770.075false/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_N=100_c=0.34_r=1_γ=4_τ=5.bson
7(4, 6, 1, 0.28)kuramoto1000.51210.075true/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_noisy_N=100_c=0.28_r=1_γ=4_τ=6.bson
8(4, 5, 1, 0.34)kuramoto1000.4089790.075true/home/travis/build/JuliaDynamics/DrWatson.jl/docs/data/results/kuramoto_noisy_N=100_c=0.34_r=1_γ=4_τ=5.bson

We can take also advantage of the basic processing functionality of collect_results! to use the excluded "error" column, replacing it with its average value:

using Statistics: mean
special_list = [:avrg_error => data -> mean(data["error"])]
res = collect_results(
      datadir()*"results",
      black_list = black_list,
      special_list = special_list
)

deletecols!(res, :path) # don't show path this time

8 rows × 7 columns

embeddingmodelNxnoisenoisy_trainingavrg_error
Tuple…⍰String⍰Int64⍰Float64⍰Float64⍰Bool⍰Float64⍰
1(4, 6, 1, 0.28)barkley1000.8090.075false0.36865
2(4, 5, 1, 0.34)barkley1000.8038210.075false0.438349
3(4, 6, 1, 0.28)barkley1000.8985050.075true0.685996
4(4, 5, 1, 0.34)barkley1000.4415280.075true0.316476
5(4, 6, 1, 0.28)kuramoto1000.7878750.075false0.591969
6(4, 5, 1, 0.34)kuramoto1000.651770.075false0.631568
7(4, 6, 1, 0.28)kuramoto1000.51210.075true0.709027
8(4, 5, 1, 0.34)kuramoto1000.4089790.075true0.574161

As you see here we used collect_results instead of the in-place version, since there already exists a DataFrame with all results processed (and thus everything would be skipped).

Adapting to new data/parameters

We once again continue from the above example. But we suddenly realize that we need to run some new simulations with some new parameters that do not exist in the old simulations... Well, DrWatson says "no problem!" :)

Let's save these new parameters in a different subfolder, to have a neatly organized project:

general_args_new = Dict(
    "model" => ["bocf"],
    "symmetry" => "radial",
    "symmetric_training" => [true, false],
    "N" => [100],
    "embedding" => [ #(γ, τ, r, c)
    (4, 5, 1, 0.34), (4, 6, 1, 0.28)]
)
Dict{String,Any} with 5 entries:
  "symmetry"           => "radial"
  "model"              => ["bocf"]
  "symmetric_training" => Bool[true, false]
  "N"                  => [100]
  "embedding"          => Tuple{Int64,Int64,Int64,Float64}[(4, 5, 1, 0.34), (4,…

As you can see, there here there are two parameters not existing in previous simulations, namely "symmetry", "symmetric_training". In addition, the parameters "noise", "noisy_training" that existed in the previous simulations do not exist in the current one.

No problem though, let's run the new simulations:

mkpath(datadir()*"results/sym")

function cross_estimation_new(data)
    γ, τ, r, c = data["embedding"]
    N = data["N"]
    # add fake results:
    data["x"] = rand()
    data["error"] = rand(10)
    # Save data:
    prefix = datadir()*"results/sym/"*data["model"]
    get(data, "symmetric_training", false) && (prefix *= "_symmetric")
    save(
        savename(
            prefix,
            (@dict γ τ r c N),
            "bson"
            ),
        data
    )
    return true
end

dicts = dict_list(general_args_new)
map(cross_estimation_new, dicts)

# load one of the files to be sure everything is ok:
filename = readdir(datadir()*"results/sym")[1]
file = load(datadir()*"results/sym/"*filename)
Dict{String,Any} with 7 entries:
  "symmetry"           => "radial"
  "symmetric_training" => false
  "model"              => "bocf"
  "N"                  => 100
  "embedding"          => (4, 6, 1, 0.28)
  "x"                  => 0.176957
  "error"              => [0.498813, 0.148692, 0.323359, 0.835796, 0.692457, 0.…

Alright, now we want to add these new runs to our existing dataframe that has collected all previous results. This is straight-forward:

res = collect_results!(datadir()*"results";
      black_list = black_list, subfolders = true)

deletecols!(res, :path) # don't show path

12 rows × 8 columns

embeddingmodelNxnoisenoisy_trainingsymmetrysymmetric_training
AnyAnyAnyAnyAnyAnyString⍰Bool⍰
1(4, 6, 1, 0.28)barkley1000.8090.075falsemissingmissing
2(4, 5, 1, 0.34)barkley1000.8038210.075falsemissingmissing
3(4, 6, 1, 0.28)barkley1000.8985050.075truemissingmissing
4(4, 5, 1, 0.34)barkley1000.4415280.075truemissingmissing
5(4, 6, 1, 0.28)kuramoto1000.7878750.075falsemissingmissing
6(4, 5, 1, 0.34)kuramoto1000.651770.075falsemissingmissing
7(4, 6, 1, 0.28)kuramoto1000.51210.075truemissingmissing
8(4, 5, 1, 0.34)kuramoto1000.4089790.075truemissingmissing
9(4, 6, 1, 0.28)bocf1000.176957missingmissingradialfalse
10(4, 5, 1, 0.34)bocf1000.00366527missingmissingradialfalse
11(4, 6, 1, 0.28)bocf1000.0268539missingmissingradialtrue
12(4, 5, 1, 0.34)bocf1000.850146missingmissingradialtrue

(subfolders = true ensures that we scan the new data)

All missing entries were adjusted automatically :)