Jump to content
The Corroboree
Sign in to follow this  
CLICKHEREx

DARPA's Building a New Search Engine to Crawl the Deep Web

Recommended Posts

http://motherboard.vice.com/en_uk/blog/darpas-building-a-new-search-engine-to-crawl-the-deep-web?trk_source=recommended [older]

February 10, 2014 // 08:46 PM CET

Image: Memex

The massive internet brain is some 500 times bigger than what we web users can actually see. Search engines only index a fraction of the web pages online, and the rest of the internet remains hidden from view—thousands of terabytes of invisible information. Unsurprisingly, the Defense Department wants to gain access to the internet's hidden data, and it has a plan to create an entirely new search paradigm for the military, law enforcement, and intelligence agencies to use to shine a light on the deep web.

Yesterday DARPA called for proposals to create a next-gen search engine to "revolutionize the discovery, organization and presentation of search results." The project's name, Memex, a portmanteau of "memory" and "index," comes from a way-ahead-of-its-time concept for indexing the world's information that was floated in 1945 by scientist Vannevar Bush, and eventually led to the invention of hypertext, the World Wide Web, and personal computers.

More on that later; first, here’s how DARPA plans to access the invisible web. The agency laid out what it sees as the shortcomings of search today: It ignores shared content across web pages, doesn't save browsing sessions or allow results to be shared with collaborators. It doesn't crawl sites that aren't indexed, only organizes results in a list of links, and requires entering the exact right text to get the results you’re looking for.

Most importantly, it's centralized—search today is a one-sized-fits-all product. Instead, DARPA wants a system that can tailor searches to focus on a specific topic, or realm of the internet. It would automate the process, continuously crawling the web for a mission-specific subject, and would leverage image recognition and natural language technology to find content beyond plugging in certain keywords.

It would also drastically expand the scope of what is indexed, to include "link discovery and inference of obfuscated links, discovery of deep content such as source code and comments, discovery of dark web content, hidden services, etc,” according to the project report.

The idea is to eventually use the personalized indexing to comb through the hoards of information that are in the public domain but currently not indexed. But first, the military would focus on hunting down human traffickers, and the modern-day slave trade that lives largely on the web in forums, chats, advertisements, job postings, and hidden services. It’s also eyeing the counterfeit goods, missing people, and found data realms.

Naturally, the government trying to pry into every nook and cranny of the internet is a loaded topic right now. But the defense agency claimed, for what it's worth, that while it's sniffing around the deep web it's not trying to out any anonymous users or spy on anyone. It states it's "specifically not interested in proposals for the following: attributing anonymous services deanonymizing or attributing identity to servers or IP addresses, or gaining access to information which is not intended to be publicly available." But exactly how the DoD plans to bust sex traffickers in the hidden web without deanonymizing users or identifying IP addresses, you’ve got me.

That mystery aside, the mid-Century memex contraption that's inspired DARPA's latest project is fascinating in retrospect. The agency is drawing on an idea first conceived during World War II, and described by Bush in an Atlantic article called As We May Think.

Bush wrote that when the war is over, scientists should get to work on the "massive task of making more accessible our bewildering store of knowledge." Decades before the personal computer came along, Bush imagined a "device," he named memex, that would be used a a mechanism for finding and organizing the world's information, basically acting as a mechanical backup for the human brain.

He imagined a desk with a keyboard, buttons, levers, and two slated translucent screens for reading. It could store troves of information—books, articles, scientific work all stored as microfilm. Users would consult the record by inputting a code to pull up a certain book, and pulling the lever to scan through the pages backward and forward. They could also use a stylus to take notes on the second screen.

But where Bush's proto-hypertext vision deviates from modern day search is that he envisioned being able to save and build on "trails" of information gathering—like going down a series of Wikipedia rabbit holes and then being able to save that adventure, recall it later, and share it with other researchers.

Per As We May Think:

Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. The lawyer has at his touch the associated opinions and decisions of his whole experience, and of the experience of friends and authorities. The patent attorney has on call the millions of issued patents, with familiar trails to every point of his client's interest. The physician, puzzled by a patient's reactions, strikes the trail established in studying an earlier similar case, and runs rapidly through analogous case histories, with side references to the classics for the pertinent anatomy and histology.

In a nutshell, Bush wanted to mimic how the human brain thinks, learns, and remembers information. Which is exactly what artificial intelligence researchers at the DoD and in Silicon Valley are trying to do now, to glean better insights from the unruly army of big data being collected by web giants and the military alike.

Now DARPA plans to extend that next-gen capability to the deep web, or at least try to—a rather unsettling prospect regardless of the agency’s no-spying disclaimer. While I’m all for improving search and unveiling the internet’s untapped information, what are implications for people with good reason to stay in the digital dark—users trying to evade censorship, whistleblowers, journalists, and activists? Exactly how much light does the military want to shine on the hidden web?

Share this post


Link to post
Share on other sites

If you think they don't have it already (your kidding yourself!)

Share this post


Link to post
Share on other sites

If the search capabilities of this project are made public, I don't have a problem with this. Easy access to information is important for everyone, not just secret intelligence agencies, ethnobotanists even.

Share this post


Link to post
Share on other sites

If you think they don't have it already (your kidding yourself!)

2 ½ month turn around from first proposal, thats quite an achievement

Share this post


Link to post
Share on other sites

Reading this... I'm not sure they mean tor/.onions? Only a few percent of the web is actively indexed by the big search engines. There's plenty of shit out there that isn't indexed or is on private networks. I could be wrong though...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
Sign in to follow this  

×