Are you saying that people aren't storing data on github? How large of datasets ...

Fomite · on May 8, 2015

It should be noted that there are reasons beyond file size not to store data on GitHub, though I do regularly generate simulation data that's way larger than 100 MB CSV files.

tmarthal · on May 8, 2015

Can you list the reasons that you're thinking of not storing data on Github? I would assume that you have a (paid) user or organization account and keep the repo private, unless it was for a published paper.

I was thinking more along the lines of data that you'd want to graph or preset for analysis, as in a notebook or a paper. I would (hopefully) assume that your simulation data would be generated and used in such a way that it wouldn't need to be stored permanently. At least, when I was doing simulation based analysis, I wasn't necessarily concerned about any individual run, but rather a combination of a bunch of simulations (all of which were ephemeral).

Fomite · on May 8, 2015

Because there are data use agreements for health data with personal identifiers, and the vast majority of them aren't going to accept "We're keeping a copy on a private Github repo".

Nor, to be blunt, should they.

My particular simulation work is rather interested both in individual runs (and indeed, individuals within those runs) as well as summarization.

Beyond that, what use is there to putting just the "summary" data online, when the underlying data made that still exists as "you're just going to have to trust me". Being able to replicate my figure code doesn't get people very far.

sciencerobot · on May 8, 2015

Just curious: why not just provide the code that generates the simulation data?

Fomite · on May 8, 2015

Because it's an absolutely massive code base that requires access to some serious HPC resources, which basically makes it difficult if not impossible to reproduce for most people. Putting something out there like that also implies something of an obligation to maintain and support it, which isn't what we do.

There's also a weird data rabbit hole. Is the code that generates the simulation data enough, without the underlying data that code uses? Some of that is either protected or proprietary, so even with the code, it's utterly useless.

vhffm · on May 8, 2015

Simulations may run for weeks or months on large computing clusters. People wanting the resulting data may not have suitable access and/or resource allocations to repeat the runs.