Safe Web Crawler

Continuing the discussion from Good list of test sites to browse:

I thought it is impossible to do web crawl in safenet due to encryption, and chucks? And that, if one wants to share something, it needs to save through index type website. Am I wrong?

4 Likes

Yes, I would be keen to read about this as well. Might be a good tutorial topic! Being able to publish your own sites is great, but sharing is even better.

I think I have yet to figure out even how to see someone else’s public website. :slight_smile:

2 Likes

Crawling SAFE is no different to the clear web.

7 Likes

If the public can read it then a crawler can read it.

The crawler starts off with one safesite and then finds any links in that and then crawls those safesites. Rinse and repeat.

And like now the “dark” sites rarely ever have a crawler touch them because they are not linked anywhere OR they use ports/security that crawlers obviously cannot get past. Which is the same for SAFE. Unknown SAFE sites cannot be found by a crawler OR are encrypted to the general public

9 Likes

In ruby (with the ruby-safenet gem) you can read a website like this:

html = safe.dns.get_file_unauth('test1', 'www', 'index.html')['body']
# => "<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset=\"utf-8\">\n    <title>Test Safenet</title>\n  </head>\n  <body>\n    <a href=\"safe://www.test2\">Link A</a>\n    <a href=\"safe://www.test3\">Link B</a>\n  </body>\n</html>"

Then, you can use Nokogiri or URI.extract to extract the urls:

URI.extract(html)
# => ["safe://www.test2", "safe://www.test3"]

And then you can break these URLs into protocol/host/path with URI.parse and repeat the process recursively.

4 Likes

Here is a more detailed scraper:

require 'safenet'
require 'uri'

def get_links(url)
  uri = URI.parse(url) # parse url
  service, domain = uri.host.split('.') # www.something -> domain = 'something', service = 'www'

  html = safe.dns.get_file_unauth(domain, service, 'index.html')['body'] # read safe://www.test1/
  html ? URI.extract(html) : [] # extract links if page exists
end

# client
safe = safenet_quick

# load list of urls (creates if doesn't exist)
urls_parsed   = JSON.parse(safe.sd.read_or_create('list_urls_parsed', [].to_json))
urls_unparsed = JSON.parse(safe.sd.read_or_create('list_urls_unparsed', ['safe://www.test1'].to_json))

# parses "safe://www.test1" recursively
while url = urls_unparsed.pop
  urls_unparsed += get_links(url)
  urls_parsed   += url

  # save on the network
  safe.sd.update('list_urls_parsed', urls_parsed.to_json)
  safe.sd.update('list_urls_unparsed', urls_unparsed.to_json)
end

Then you can put this script on cron and develop a website that reads “list_urls_parsed” and display the scraped pages. Also, you can open the unparsed list to everyone collaborate with an Appendable Data.

8 Likes

I suppose we could also parse the words in each page, store an index of some form in an appendable/mutable, then an app can ask the index…
EDIT We need to write this one really well before some pain in the neck comes with tailored ads…

3 Likes

This ad is delivered 2 you, through eddyjohn’s mischief network :kissing_heart:

On a serious note, let’s not have one website doing the crawling, it would be nice if end users SAFE clients did the crawling decentralized. If at all possible…

3 Likes

haha good one thanks :slight_smile:

the Safe Browser could have a tick box to enable/disable contributing to the crawling effort. It would indeed help to prevent non objective selection of what is indexed or not, what results are displayed or not , in what order…and much more efficiency for pages with little or no links from outside.
You would need to be very careful not to forget to disable it while you browse your super top secret agent forum, though.
I didn’t take time to verify but I’m sure there is a topic about this somewhere.

1 Like

It is slightly different from clear web, for not having ISP and other servers in the middle on the network; so, there’s no option on a top 1 million Alexa list or similar traffic analysis available… it’s all then from client perspective and not from the network… at least as far as I understand it.

The only change to that would perhaps be some future google-analytics like data from sites using whatever that was but they would be known already and that would just be an attempt at traffic ranking.

So, the only crawl on SAFE I’ve seen is the one I’ve done, which simply has a sensible guess at urls and notes responses.

3 Likes

I can see a system where SAFE sites would submit their root page to a “crawler” DB and then the crawler would crawl the site.

Also on the clear web the crawlers can receive notifications of new domain name registrations and include those on their crawl schedule.

2 Likes

So like, the networks crawls itself to create a search engine? :wink:

2 Likes

@davidpbrown: It is slightly different from clear web

@happybeing: Crawling SAFE is no different to the clear web

Think so too: Either other pages already known to crawlers link to the new page or a scan of GMail messages stumbles over links to this new page. You can also submit it yourself in their webform for this purpose. All other means of initial indexing derive from this?

About submitting pages yourself, according to @Tim87’s proposal it could be done like this for pages on the Safe-network:

In the thread David seems to like this approach, leading to “answer engines instead of search engines”.

Yup! When Google decides to point its crawlers at safe net, the search problem pretty much goes away. Of course, we may want alternatives, but they have won popularity on the clear web for good search results.

1 Like

@davidpbrown what was your guessing strategy? Did you guess based on a dictionary, on the forum usernames, on popular websites domains?

Hum… At what price…

2 Likes

Sure, but they will crawl safe net regardless if it suits them. They may even provide proxy access, similar to cached access on clear net too.

Posted but thought to look into the following afterwards :no_mouth:

Does Google Sniff Your Gmail to Discover URLs

So here is the question: Does Google scan Gmails to see URLs shared within them, and then does it use these to discover new content? There are many who adamantly maintain that they do.

These people, “Digital Marketing Success By Design, contact us”, “decided to put that to the test”, whether or not Google is reading (g)mail in order to see what it possibly is linking to.

Surmises: scanning is for other purposes, just not in order to point the crawler to links in the content also.

Murmur of leaks noticed continues, even at end of the above experiment’s account. Why people would think it’s possible anyway is maybe because of this sort of attention to detail:

Google just dodged a privacy lawsuit by scanning your emails a tiny bit slower

The company won’t do ad scans until after a message hits your inbox on behalf of non-Gmail users, who haven’t agreed to have their emails scanned under Google’s Terms of Service. Because Gmail’s ad-targeting system draws on every email a Gmail user receives, it inevitably catches some messages from non-Gmail addresses. Scans that take place before emails are available to the user are particularly sensitive, since they’re not yet part of Gmail’s inbox. In real terms, that gap lasts only a few milliseconds.

So data can be used any other way, as stated (in 2014):

Google admits it’s reading your emails

GOOGLE HAS UPDATED its privacy terms and conditions, eroding a little more of its users’ privacy.

Our automated systems analyse your content (including emails) to provide you personally relevant product features, such as customised search results, tailored advertising, and spam and malware detection. This analysis occurs as the content is sent, received, and when it is stored.

2 Likes

Google does no evil… by redefining good. The small evil for the greater good fallacy is just another symptom of conservative thought that leeches into every area, tempting those who can with more power and wealth.
Reasons we need SAFE to help avoiding those who ‘know’ best what is good for others.

All the above+… it’s not hard to do. Those who put up sites tend not to be trying to hide them. Naturally, I doubt that I guessed them all and I know of no sure fire way to catch everyone that exists.

1 Like

What I meant is mostly that if Google indexes Safe ( and we can expect they will ) , then their issue with searching Safe is resolved.
The results they will serve are by design oriented for their profit, and do not necessarily serve the common benefit ( some results can purposedly be ommited, or buried deep in the ranking ).
So even if they solve Safe searching, we will still need to create a non profit oriented, decentralized search. ( just like we still need it for the clear web, btw )

3 Likes