Ensuring data integrity on Safe

dirvine · October 19, 2020, 3:47pm

Of course we could easily not deliver any chunks currently being tested and so on. Lot’s of wee edge cases but this is where I recon our testing will be invaluable. We have not impl’d the checks as they are not priority right now, when we do there are a few rules to be finalised. The bottom line though is check via random nonce and make the Adult deliver the answer.

There is a chance he has it elsewhere and that is OK as long as he can get it quickly. There is a possibility he could take a chance on that, regardless where it is, perhaps even on Safe, but is it worth the risk to him? If we did an update to catch more edge cases then he could be in as much trouble again.

[we are both assuming old not used much data here, so Get rewards etc. don’t count and that’s cool]

dirvine · October 19, 2020, 4:03pm

I think there are many ways we could do this. Just shooting the breeze here, but latency or similar is a huge issue. Gonna deviate her a bit so forgive me.

So failing to reply to a heartbeat (at the network layer) can be the node is off-line. However that’s’ not enough.

A node must be on-line and responsive (or behave) and that is where the latency fights with consistency (even eventual consistency). A node has to respond in a time deemed reasonable and has to respond to each request. A node that cannot do this needs penalised. So a very tricky balance to find there. I go for consistency mostly, so allowing nodes out of sync by a bit is fine, but too much is dangerous to the network in terms of user experience.

Back to the issue.

I love to look at these problems as “so what” issues. Let’s say the node has not stored the data but tries to get it some other way and thinks, this is great I get to ignore storing all that old stuff (let’s assume it’s never requested in the nodes life in this section). Is it a problem? what if the others think the same etc.

Then looking at fixes? let’s say there is a check happening and that is old data, never really requested and during the check we do slow any Get until after the check completes. Then we prevent this attack using Safe as tertiary storage and it’s probably OK as we will still be much faster than say amazon tertiary storage? So in Safe old data might in some rare circumstances be slightly slower to retrieve. It could be a nice solution that’s not too complex and one we can live with? What’s your thoughts on that @Antifragile

dirvine · October 19, 2020, 4:06pm

In fact during a check, if there is a Get request we get all the data anyway so we can check the node gave us it, even without passing the nonce part of the process. By then we have the actual self validating data (name == hash of content).

Traktion · October 19, 2020, 4:09pm

I believe disk space is also much cheaper than bandwidth. Economically, it wouldn’t make much sense to not store it and call out for it instead.

Ofc, if you are on an unmetered bandwidth connection, maybe you don’t care. Even then, if you have to wait for other nodes to do 2 retrieval before you can reply to the first, it isn’t going to look good.

Southside · October 19, 2020, 7:30pm

What - if anything - do we lose by the Adults in a section being unaware of the other Adults in that section?

dirvine · October 19, 2020, 10:30pm

I am not sure we lose anything just yet. We will investigate more though. Elders only need to know and should be the fountain of all knowledge there.

mav · October 19, 2020, 10:43pm

Hmmm, I’m not sure about that. We do lose confidence of the real degree of redundancy. We don’t know if redundancy for that chunk is 8 or 7 (or less). It’s “only one difference” but I wonder how slippery that slope is. Not to say this is a deal breaker but seems to me there is some (maybe small) loss when that node sneakily fetches from another node.

In this case there would be some sort of ‘nested’ or overlapping request for that chunk which would not normally happen. Elders can detect this, even when the node is using a different client to request with. Detecting nested requests may introduce some false positives but if there’s statistical variation for one particular node then it’s a red flag.

Immutable data is easy to audit.

Old mutable data is easy to audit because it should be consistent across all nodes.

Frequently / recently updated mutable data seems harder to audit but maybe it doesn’t need to be audited since the frequent access acts like an audit anyhow.

Will be very interesting to see how this audit mechanism goes in real life.

Is this only for immutable data? Can mutable data be checked by hashing the content?

dirvine · October 19, 2020, 10:53pm

Yes only immutable. Adults hold all immutable data atm. Elders hold all transactional data. For client data when it’s all CRDT then there are pros and cons. A client can have copies of latest state and be looking for an update etc.

The fan in / fan out paradigm here will be tweaked in testnets. So a client will ask for mutable data ABC, he can ask >majority nodes for this (fan in) or just 1 or a few nodes. On write he can write to >majority (fan out) to ensure they have the latest state. However to merge (consider sending multiple ops in op based crdt a merge as it basically is) the client on write should write to all nodes.

As Elders merge data crdt allows them to all get to the latest, assuming no writes in progress, but if there are he will get close to latest as we use causal order. So the most current, even if not merged will be delivered, so kind best of both worlds. The majority in crdts that are signed is not that important as all a node can do to cheat is miss out some latest ops, but another node can give us those if we ask. i.e. our client data crdts will be fraud proof (just means we will sign the Dot (order) plus operation). An interesting thing currently under review in house. Seems like a big win, never mind off line working.

The CRDT in a dynamic permissionless network is interesting as we look for consistency (crdt is strong consistency) and availability (network is running). But the big win is partitioning as far as data is concerned as we can handle multiple partitions naturally. The network agreed parts are more tricky, but then that’s only a bonus for client data.

jlpell · October 20, 2020, 2:26am

This isn’t really possible since dirvine’s nonce based audit will be checking all the chunks in the section at a routine interval.

A few random thoughts on this topic:

It would appear that Chunk auditing/checking by elders could be made inexpensive/efficient if instead of asking the target vault to hash the entire chunk with a nonce, it was told to hash a nonce with some random subset of 4kib sectors, within the chunk at a (randomly) chosen xor address. The size of the subset could also be random from 1 to a max of 256 sectors for 1MiB chunks. In fact, the nonce+ hashes for sets of 4kib sectors at many xor addresses within a single vault could be sequentially binned together to form a standard 1MB or 1MiB “audit chunk” that would be returned from the younger vaults to the elders.
A standard framework that checks 1MB chunks for client GET requests could also seamlessly check 1MB audit chunks during Elder audit requests. Case A : Elders handle a client GET request. They ask for the chunk from the close group, check the returned chunks for any bad data, pass the known good chunk back to the client, and assign penalties to the nodes that acted poorly or gave bad data. Case B: Elders generate an audit request. A standard 1MB audit chunk will be generated by targeted vaults based on a set of xor addresses and a subset of random 4k sectors within each xor address provided by the elder. The audit chunks from the close group are returned to the elders, they check for consistency and pass the audit chunk to /dev/null. Any faulty nodes are then penalized.
The chunk audit/checking process itself serves as a kind of network heartbeat. Not only does it serve as a heartbeat check for a vault, but it also checks that the vault is maintaining all chunks and can serve them.
The rate at which data is requested from vaults could be made constant while ensuring QoS. For example, a 20Mbit internet connection can support about 2 chunks per second. If clients are only requesting about 1 chunks per second, the balance could be made up with 1 “audit chunk” by the elders and vice versa. When client usage is max, the GET itself is used to check integrity, whereas when there is no client activity the full vault bandwidth is used for filling audit chunk requests from elders. No matter what the vault might always supply 2 chunks per second.
In the example above, a constant 2 chunk per second upload 24/7/52 might make your isp angry. Audit rates could be randomized to maintain a lesser average rate according to bandwidth limits reported by the vault.

Toivo · October 20, 2020, 8:00am

I don’t really understand any of this, but just to get the picture:

When client asks for data, what does it ask for, spesifically? Chunks? If yes, then I have nothing to add.

If it sends out datamap to elders as a list: “May I have these, please?”, then the elders checks “Ok, where these chunks are?”, then the adults (or clients) don’t need the permission / capability to ask for individual chunks. And they don’t need to know which datamap the chunks belong, and then they cannot pass on the request? So the request for a chunk is valid only if it comes from and elder.

digipl · October 20, 2020, 8:30am

The size of the chunk can range from 1KB to 1MB. Any verification will have to consider this factor.

dirvine · October 20, 2020, 10:28am

Yes, for immutable data the client asks individually for each chunk.

jlpell · October 20, 2020, 1:09pm

Hard disk sectors are normally 4096 bytes. Even if a chunk is 1KB it will still occupy 4KiB on disk unless the filesystem used has an efficient fragment packing method. Also consider the metadata associated with a 1KB payload will take up space too.

IMO an ideal scenario would be to have the chunk payload plus metadata aligned to 4KiB sector sizes.

This means that one has 96 Bytes available for metadata given a chunk payload of 4000 Bytes (4KB +96B = 4KiB) If the chunk payload is 1KB you could have up to 3096B for metadata.

Does anyone know how much space is currently occupied by a chunk’s metadata?

digipl · October 20, 2020, 1:43pm

Yes, but the elders must know each chunk size to make a correct verification. Not sure that’s right.

dirvine · October 20, 2020, 3:08pm

The chunks metadata is split up, so there is the owners metadata, the data map. Then the networks metadata or admin data. This data is stored on Elders to know where the chunk is stored and how many copies etc.

Topic		Replies	Views
Critical Criteria: Farming Safe-Node	16	2691	October 23, 2015
Important considerations for data integrity Safe-Node	21	1581	October 18, 2020
Some Questions about SAFE Features	3	881	July 3, 2017
What about nodes sniffing data? Features	12	2866	August 31, 2014
Bit rot for farmers Features	9	1343	January 18, 2017

Ensuring data integrity on Safe

Related Topics