David Gleich

arXiv vanity works amazingly well with some small warts.
Dec 16, 2020

Running our 80-page survey paper on flow-based network cluster improvement through the processor arXiv vanity shows some impressive results. www.arxiv-vanity.com/papers/20… This is not perfect, but pretty amazing considering some of the tables and embedded “aside” paragraphs translate.

The biggest bummer for me is that our macros aren’t translated at all. e.g. www.arxiv-vanity.com/papers/20…
Visualizing COVID Cases and Tests - Purdue vs. Vermont
Nov 16, 2020
As a fun little exercise with my students, we sought to improve the Purdue COVID tracking dashboard. Our quick take was that this display focused too heavily on tests instead of the more relevant positive cases.

After a few iterations, we wanted a design that had the same bits of information, but presented in a way that made them easier to understand. For instance, we ditched the two-axis design with tests and cases mixed and instead scaled the testing volume down to get a ‘positive rate’ line that acts as a means of evaluating the case counts. (After all, that’s what you’d use the test count data for!)

Here’s the design.

The solid vertical bars are the daily case counts. The thick-black line is the running 7-day average. The internal color of this line gives the positivity rate where 1% is green and 9% is red. The dashed colored lines show the positivity rate given implied by running 7 day count case count. (There are strong day-of-week effects, so the raw data isn’t as meaningful, but 7 day trends are reliable.)

Here, you can see that testing is basically constant through Nov 10 because the constant rate of positivity bars. You can also quickly see that cases are growing recently. (That’s also true everywhere! Maybe we need another plot that compares these to successively larger regions!)

For fun, I tried the same plot for the state of Vermont (I have a relative at UVM!). Vermont has ~623k people. (12x Purdue’s 50k).

In Vermont, we can see a few testing spikes in the Vermont data, and also the recent uptick in cases. (The last few days are actually quite a bit worse in Vermont, but I wanted the same data region.)

I think the Vermont data shows the visualization isn’t quite as successful in general as we might have hoped because testing there seems less consistent than Purdue.

Caveats
- These were meant to illustrate a visualization (hah, but the title implies a comparison). A good visualization facilities comparison, in which case we might ask more refined questions.
- Critically evaluate the data yourself before drawing any conclusions from these pictures (e.g. maybe I used the wrong field from the covid tracking project data for Vermont!) There may be a typo in my Purdue case counts, which we transcribed by hand.
Here’s the code
```
## Download vermont
using CSV
data = CSV.read(download("https://covidtracking.com/data/download/vermont-history.csv"))
## align dates
ds = data[6:6+101,:]
##
ds.totalTestsPeopleViralIncrease
##
ds.positiveIncrease

using Plots
gr()
function covidplot(ntests, npositive)
    A = [ntests npositive]
    avgs = []
    ratios = []
    tvol = []
    for i = 1:size(A,1)
        si = max(1,i-6)
        ei = min(size(A,1),i)
        push!(avgs,mean(A[si:ei,2]))
        push!(ratios,sum(A[si:ei,2])/sum(A[si:ei,1]))
        push!(tvol, sum(A[si:ei,1])/length(si:ei))
    end


    cmap = cgrad([:limegreen, :goldenrod1, :firebrick1], [0.2, 0.5, 0.8])
    xx=-size(A,1)+1:0
    Plots.bar(xx,A[:,2],alpha=0.4,
        label="daily cases", linewidth=0, legend=:topleft,
        #fill_z = 100*A[:,2]./A[:,1],
        #color=cmap,
        clims=(1,9),
        legendfontvalign=:top)
    for p in [1,3,5,7,9]
        plot!(xx, (p/100)*tvol,
                linestyle=:dash, line_z=p,
                label="", color=cmap, clims=(1,9))
        annotate!(1, (p/100)*tvol[end], Plots.text("$p%", 10, get(cmap,p/10), :left))
    end
    plot!()
    #plot!(1:size(A,1),avgs, color=:black, linewidth=3.5,
        #label="")
    plot!(xx,avgs, color=:black, linewidth=3.5,
            label="avg. cases over 7 days", legendfontvalign=:top)
    plot!(xx,avgs, line_z = 100*ratios,
        color=cmap, linewidth=1.5,
        label="", clims=(1,9), colorbar=false,
        foreground_color_legend=nothing,
        background_color_legend=nothing,
        legendtitle="Colors show positivity rate",
        legendtitlefontsize=10,
        legendtitlefontvalign=:bottom,
        legendtitlefonthalign=:right, size=(500,300),
        legendfontvalign=:top)
    plot!(xlabel="days since November 10, 2020")
    plot!(ylabel="PCR positive SARS-nCOV-2 tests")
end
covidplot(reverse(ds.totalTestsPeopleViralIncrease),
    reverse(ds.positiveIncrease))
ylims!(0,121)
plot!(legend=:topright)
```
Project and Forget Metric Optimization - Explained
Nov 11, 2020

In a new video, I do a deep dive with Rishi Sonthalia and Anna Gilbert about their paper on optimization with metric constraints – and their associated code together with my former student Nate Veldt who worked on similar problems with me. Their stuff is super cool and we go into why it’s awesome!

The video - <youtu.be/OBC4FmUui…>

Their paper - <arxiv.org/abs/2005….>

Their code - <github.com/rsonthal/…>

More context – a month or so ago (who can keep track of time – maybe this was September, actually) I sat down with Rishi Sonthalia, my former student Nate Veldt, and Anna Gilbert to understand Rishi and Anna’s latest paper on “Project and Forget Metric Optimization” <arxiv.org/abs/2005….>. For fun, we decided to record this so that I’d be able to look at this again when I wanted to keep working on this. But it turned out so great we figured we’d share it! (After spending too much time editing it, of course…!)

I care about this because it gives a faster way to solve a quadratic or linear program based on a relaxation of the correlation clustering problem. I think it also gives a great place to build future HPC-style work on graphs where there is a problem that takes considerable time and computing effort. The particular bottleneck in the current code is computing an all-pairs-shortest path computation, which is somewhere where there has been a tremendous amount of work that could be used to make this go even faster!
Fixing the NYTimes Horrible Exit Poll page.
Nov 6, 2020

I have so many problems with this display of data in the attached screenshot from the nytimes https://www.nytimes.com/interactive/2020/11/03/us/elections/exit-polls-president.html. It’s so easy to misread this in very subtle ways. E.g. this makes it seem like most trump voters don’t trust their counts accurately? But that’s not what the data show at all.

The second one (without the pictures) is an improvement I managed to make in excel. This wouldn’t pass muster (yet) if the PhD students I work with tried to include it in a paper, but it’s close enough and I did it in Excel. Mainly to demonstrate it’s possible. (In retrospect, Inkscape would have been easier…)

This shows almost no difference among the groups of voters! Which is, of course, because we are all Americans and largely trust our voting and make up our minds earlier.
Designing for Julia is hard
Oct 29, 2020
As a longtime Matlab user, someone who enjoys Python syntax, and loves crafting C codes for maximum efficiency, I often struggle with thinking about code interfaces in Julia.

Why? Because the interface possibilities in Julia are extremely powerful. This power can be difficult to deploy, however. Simply put, other languages have conventions or requirements that are essentially constraints on what interfaces ought to look like. These constraints restrict the design space, once they are understood, and simplify design. In contrast, Julia offers more possibilities.

Here are a few relevant thoughts.
1. Matlab interfaces tended to be holistic functional interfaces that ‘do everything’. Although one might think these are hard to do, packing all the relevant information into a single function call is a useful design constraint that forces or suggest certain choices.
2. Why isn’t C harder than Julia? The reason that C isn’t harder than Julia is because C interfaces essentially need to fix a small set of types to be reasonable.
  
  Aside C++ templated interfaces are essentially equivalent to many of the design challenges in Julia. Designing these right and Julia interfaces present similar challenges.
3. Python promotes the class as a general structure to organize interfaces.
Consider what linear programming might look like in these different languages.

Matlab
```
optval, optarg = linear_program(c,A,b) # solve min c'*x s.t. Ax = b, x >= 0
optval, optarg, dualvars = linear_program(c,A,b) # and get dual variables too
optval, optarg, dualvars = linear_program(c,A,b,'method','ip','tol',1e-3)
```
C99 for single line comments
```
p = alloc_problem(c,A,b)    
s = alloc_solution(p) 
solve_lp(p, &s) # s has fields, optval, optarg, dualvars    
o = interior_point_options()
o.tol = 1e-3
solve_lp_interior_point(p, &s, o) // s has fields, optval, optarg, dualvars 
free(p); free(s); // very important in C
```
Python
```
p = opt.LinearProgram(c,A,b)
M = opt.algs.LinearProgramDefault()
soln = M.solve(p) # soln has optval, optarg
# get optval, optarg, dualvars from soln 
M2 = opt.algs.LinearProgramInteriorPoint({'tol':1e-3)}
soln = M2.solve(p)  
```
Notice that the Matlab interface is probably the easiest and simplest, if far more limited.

There are a number of common challenges that these interfaces have to deal with.
- How to specify the problem
- How to specify a solver
- How to parametrize the solver
- How to get solution pieces.
Matlab’s do everything solver makes this interface easy to write. It also enforces a design funnel that would channel alternatives into this uber-function. (Matlab loves the uber-function, it’s a great and timeless design!) The other two interfaces are largely similar, but for Python, where should the solve function be inside the class for the problem (P.solve(M) or M.solve(P)). (Probably M.solve(P) as then the solver can extract the relevant information it needs from the problem, whereas otherwise the problem has to fix an interface to a method.)

In C, it’s just a function-fest everywhere, and don’t forget to free.

Julia with Convex.jl
```
x = Variable()
problem = minimize(c'*x, [A*x == b, x >= 0])
solve!(problem, SCS.Optimizer)
problem.optval, x
```
Julia with JuMP
```
model = Model(SCS.Optimizer)
@variable(model, 0 <= x)    
@objective(model, Min, c'*x)
@constraint(model, A*x == b)
optimize!(model) 
```
Both of these interfaces exploit Julia’s powerful syntax transformations for all sorts of clever features that are largely impossible in C or Matlab. But back to Matlab’s advantages, I don’t want a super-general optimization tool, I just want to solve a simple LP.

Maybe I’m writing a package where I expect people to play around with various LP solvers, or I want to give them that option. How would I do this? This is where Julia design paralysis sets in. It’s definitely possible, but there are many possible ways to treat these issues and few conventions that would simplify or suggest the design.

So what’s my recommendation or advice? The simplest thing is to treat Julia like C and fix a small set of types that will work with the code. Then try and grow this into more general interfaces over time. This is essentially a restatement of Knuth’s “premature optimization dictum” https://en.wikiquote.org/wiki/Donald_Knuth

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
- Don’t be afraid to hardcode Float64, Int, etc. and don’t feel obliged to be fully type general. (Type general functions are very hard to get right and are best done after a few tries with simpler hardcoded type functions.) This will have the side effect of making fewer calls to the Julia compiler when you call functions.
- Avoid trying to use clever Julia tricks in initial designs. Keep it simple and straightforward like C.
- Don’t provide flexibility where it isn’t (yet) useful.
- Don’t be afraid to make horrible breaking changes in future code revisions. The Julia devs drove some of us nuts with constant breaking syntax changes through 0.2 - 0.7 as they evolved and revisited decisions. Designing without reality is impossible.
Back to Knuth though, with a variant with yet more relevant context.

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

The goal of these initial design ideas is to find that 3%. That can only be done once an initial framework is established.

And too close, let’s go back to Knuth.

Science is knowledge which we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it.

This exemplifies good Julia design, and why it is so hard. Good Julia design flows from a deep understanding of the topic and an appreciation of where similarities and differences lie that is often a contribution in itself.

arXiv vanity works amazingly well with some small warts.

Visualizing COVID Cases and Tests - Purdue vs. Vermont

Project and Forget Metric Optimization - Explained

Fixing the NYTimes Horrible Exit Poll page.

Designing for Julia is hard

Matlab

C99 for single line comments

Python

Julia with Convex.jl

Julia with JuMP