Informal-RFC - Internet-efficient ZIP file format

What strangeness is this you talk of Master?

2 Likes

We won’t mention the dark times of 8" floppies or 9 track tape storage and certainly not threaded core storage. :smile:

Having more understood your problem, through the responses, I think that the SAFE system already provides all the functionality you need, if you use a “zip” program and mount your SAFE storage.

The zip program will request the index portion of the zipped file and the client will retrieve the chunks required to retrieve the index. Basically normal operation for the zip program and it will not even know that the zipped file was stored on the network.

Your zipped file could be a few megabytes long or gigabytes long and assuming the index is about the same size, there will be no difference in the amount of chunks retrieved to get the index portion of the zipped file.

I don’t see a need for any more than using SAFE as is, and your zip program as you normally would.

1 Like

Nope, because each file would be spread over more chunks than necessary.
Tar ensures correct behavior. Of course also uncompressed zip and rar, i.e. tar.

The OP wanted RFC based on zip example.

Yes Rar, GZIP, Tar and a host of other file compactors are out there and some better than others for access.

The point was that the functionality is not the zipping or compaction function, but the functionality to get what the program asked for and no need to get the whole file just to get the last part of the file.

Thus no need to write a RFC for functionality that available programs + SAFE already provide

I would like to mention that chunks already get compressed by an algorithm before they’re PUT to the network.

1 Like

My thoughts are not at all focused on compression. I’d rather use 7z or RAR myself. But ZIP is a widely used format, which I suspect may operate differently under internet-based file-access methods than users are accustomed to with locally-attached file storage.

I’m anticipating that average users will find their favorite ZIP program doesn’t take the efficient action of seeking backward from the end of the file to find the index, but likely reads from the front of the file until it finds the start of the variable-length index at the end. For a file on a local disk, the wasted read won’t be significant; But on an internet disk, the experience could be terrible. And the user may perceive that as a flaw in the network, rather than lazy program logic.

I think I’ll test different zip programs on a large file in a WebDAV location. (Probably ZIP a few videos for a large archive with small index). That may give some basis to see if the concern about read-seek patterns is warranted.

That’s fine. But for the network, some little fast compression could help download speeds go up. That’s another pro compared to other projects like BitTorrent IMO. Every % faster downloadspeed is welcome.

1 Like

Also to @polporene: for the 3rd time, tar does not compact files, it only packages them.

Zip, RAR, gzip, xc… They are all widely used. It would lead to a flood of work for the core dev. In the future an abstracting API could be devised for interface with all such tools, but in the meantime people:

  • should not use compression if they want cheap reads from Safe archives, and
  • should consider using tar

Interestingly, but the latest release of Camlistore does exactly what you’re talking about:

Making Camlistore run well on cloud providers required two other major features: the blobpacked storage backend (for latency and cost reasons) and HTTP/2 (for latency reasons).

The blobpacked storage backend allows faster reading & serving of files because it stores related blobs contiguously within larger container blobs (which are also valid zip files) instead of many small randomly dispersed blobs.

Camlistore in itself is vaguely similar to this project, though with a different scope (“your personal storage system for life”) so it doesn’t have all the routing awesomeness and the like. It uses an opaque blob store and it addressed mutable data in a different (but interesting) way.

My take on the ZIP thing is that it would make a useful addition to a standard library. Just because the core itself isn’t concerned with blobs being related to each other, many applications may still want to optimize based on that.

Side note: is there a place to collect ideas for a standard library?