Things to make sure of when choosing your distributed storage:
1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)
2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)
3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)
4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)
5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)
6) do you care about availability, consistency or speed? (pick one, maybe one and a half)
7) how are you going to recover from the distributed storage shitting it's self all at the same time
Nowadays, a single 2u server can realistically support 2x 100gig nics at full bore. So the biggest barrier is density. You can probably get 1pb in a rack now, and linking a bunch of jbods(well NVMEs) is probably easily to do now.
sorry I should have added a caveat of 1pb _at decent performance_
That seagate array will be fine for streaming (so long as you spread the data properly) as soon as you start mixing read/write loads on that, it'll start to chug. You can expect 70-150iops out of each drive, and thats a 60 drive array (from guess, you can get 72 drives in a 4u, but they are less maintainable, well used to be, thing might have improved recently)
When I was using luster with ultra scisi (yes, that long ago) we had good 10-20 racks to get to 100tb, that could sustain 1gigabyte a second.
Agreed, it depends on the use case. For some "more storage" is all that matters, for others you don't want to be bottlenecked on getting it into / out of the machine or through processing.
1) are you _really_ sure you need it distributed, or can you shard it your self? (hint, distributed anything sucks at least one if not two innovation tokens, if you're using other innovation tokens as well. you're going to have a very bad time)
2) do you need to modify blobs, or can you get away with read/modify/replace? (s3 doesn't support partial writes, one bit change requires the whole file to be re-written)
3) whats your ratio of reads to writes (do you need local caches or local pools in gpfs parlance)
4) How much are you going to change the metadata (if theres posix somewhere, it'll be a lot)
5) Are you going to try and write to the same object at the same time in two different locations (how do you manage locking and concurrency?)
6) do you care about availability, consistency or speed? (pick one, maybe one and a half)
7) how are you going to recover from the distributed storage shitting it's self all at the same time
8) how are you going to control access?