PassMark Logo
Home » Forum

Announcement

Collapse
No announcement yet.

Can I index the internet & replace Google

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Can I index the internet & replace Google

    Several times every week we get asked the same question. It comes in several variations & here are some quotes from our support E-mails,

    Can Zoom scale to be the size of Google?

    Can I index the all the internet using Zoom?

    I have a list of 30,000,000 web sites, can Zoom index them?

    I want to build a search engine like Yahoo, can you give me step by step instructions?

    Can I index an infinite number of pages?

    Can your search script index the entire web?


    Maybe these people have seen the current share-market valuation of Google and want some of the action or maybe they think anyone with a old 386 PC and a dial up modem can make a new Google or maybe these people are doing more dreaming than thinking.

    So let me state up front. The entire internet will not fit on your hard disk. No, really, I mean it, not even on that new 400GB hard drive that you have.

    So let me do some quick calculations for you,

    Google indexes about 3,000,000,000 web sites.

    Lets say there is an average of 100 pages per site and an average of 40KB per page (PDF & HTML files).

    This equals storage requirements of 10,200,000,000,000,000 bytes. Which equals 12000 Terabytes of data. (or 12 Petabytes of data).

    Another way to look at it is that you are going to need about 20,000 - 50,000 PC style computers linked together with smart software. Which, unsurprisingly, is about what Google is using.

    Now this is a lot of data! We can reduce it by being smart with compression etc, but whichever you look at it, it is still a lot of data.

    Storage requirements are also only the tip of the iceberg, you need a warehouse to install your 50,000 PCs and some very serious power and cooling infrastructure. Not to mention the need for a massive connection to the internet backbone to index the entire internet.

    Now we would LOVE to build a solution like Google for someone. And for this reason we have spent a fair amount of time talking to people about what would be involved over the last few years.

    But unfortunately most people are living in a total fantasy land. Which means we waste a lot of time trying to talk some sense into these people.

    So we are happy to talk to people about indexing the entire internet, but it would be a very very serious undertaking.

    So if you have a correspondingly serious budget and at least a tentative grasp of reality, please get in contact . Otherwise keep on dreaming.

    -------
    David
    Wrensoft Web Development

  • #2
    Do people really ask such questions?

    *Chortle* Do people [i]really ask this question? If ANYONE thinks that they will be able to index the entire internet, they are absolutely nuts.

    Sure, Google does it, but you have to remember that google was created in 1996 - at the beginning of the internet boom - and was fortunate enough to grow with the internet. During the last ten years the internet has become so vast that it would now be almost impossible to break into the search engine market.

    The only possible defence that i can offer for these "people" is that they are unsure of how Zoom works. The professional version of Zoom is just about the best £60 I've spent on software - it's really, really good. And it is fantastic if you own up to 20 or so websites. But it has to index the sites before it can perform a search.

    And when the guys at Wrensoft say "serious" budget, they mean SERIOUS. Like millions of £s/$s serious.

    Do yourself a favour and please don't ask questions like this...ever.

    Comment


    • #3
      Yes people really do ask these questions. Almost always they are people with Hotmail or AOL E-mail addresses.

      I think part of the problem is that Google looks so simple on the outside. It appears to be a single web page, with a single form and a Search button. When I tell them that developing a single web like this page is going to cost them ~$25,000,000, just for the hardware, they start to ask what I've been smoking.

      The second problem is that there are still a lot of people thinking they can put up any old web site and be filthy stinking rich a few months later. The recent 'million dollar homepage' idea and Google's share market float encouraged this somewhat.

      The third problem is that the world is now a complex place and the education system hasn't really kept up. Very very few people know how even simple devices like a telephone, a fridge & electric motor work. The consequences are that people have no idea how much work is required to build something complex. You see a similar effect on the rentacoder.com web site. People asking for software development that would take several man years to complete but only offering $100 for the job.

      The forth problem is scale. It is hard for people to imagine how big the internet is and appreciate the size of the problem.

      The final problem is that people are very loose with their words. It is not unusual for people to scale back their request. "When I said I wanted to index the internet like Google, I didn't really mean the all the internet. just my website will be enough".

      -----
      David

      Comment


      • #4
        I just came across this nice quote from Jim Lanzone, Ask.com's chief executive. Which I thought was relevant to this topic, as he clearly sees a lot of the same type of people.

        "Just like a lot of people who watch movies think they can be scriptwriters, there are a lot of people who use search engines who think they can build a search engine [for the entire internet]"

        "Until you are on the other side of the wall, you will never understand just how difficult it is."

        Comment


        • #5
          Do you think the software is capable of going into a big market with the right investment and hardware, etc. I mean if someone for example decides to inverst 250.000$ of the software to index the data of a city for example, could be done with Zoom, yes and they also rent 100 Servers on a Datacenter. This is just all an example, but with the right investment and money do you think that Zoom could handle it? Or be able with some further coding? Would you be still selling the software as a standalone for 500$ or would you sell it for 10.000$?
          Some years ago i remember Inktomi was selling the software for 50.000$ and the software from Altavista was aroung 7.000$. What happend next was Inktomi was bought and the product is no longer avaible for the public. If some invests in your software would you still be giving it to your customers or would cut the person to person market right away? If this happend someday i would be sad since Zoom is great, but on the other way if Zoom is so great i would give you the money so my competition cant use it against me.

          Comment


          • #6
            We have not tested the software on such a large cluster of machines. So I would imagine there would be scaling problems, that would require some R&D and additional software development to solve. We would need a very detailed set of requirements before coming up with a price.

            If someone was to buy us out I can't speculate on what they might do with the product. But I certainly can't see this happening at the moment. It doesn't make much sense from a business point of view to buy a company only to dis-continue the main product.

            Comment


            • #7
              While It Is Not Easy, Yes you Can!!!

              I understand that this post is very old, and this reply will likely get moderated or not read at all, but i felt that it needed to be here anyway, so i thought i would give it a shot.

              The information in this thread is not entirely correct, I am a developer who works on several open source projects, and things work a bit differently in the open source arena. People freely share code and resources. This opens up many doors that would otherwise be closed.

              With that in mind, i thought everyone you might be interested in this topic miht be interested to know that there are open source projects out there that are doing just what this thread claims to be impossible. "indexing the internet" These indexes are publicly available for people to access, use, or even download ( assuming that you have the disk space ) as you see fit.

              I turn your attention to http://www.dotnetdotcom.org which is one such project which at the time of this writing has indexed 44,047,083,451 ish pages.

              I am not saying that it is practical for everyone to index the internet for themselves, which it is not. What i am saying is that if as developers we cab begin to think in a more community oriented approach with regards to the type of complex problems then the walls begin to come down.

              Just because something seems daunting does not mean that we should not try.

              And oh yea, it does NOT have to cost millions. If so these open source "free" options would not exist!!!!

              Just something to chew on.

              Comment


              • #8
                The above is just self promoting and misleading or just hopelessly naive. Their public "index" at the time to writing is a tiny 600,000 pages. And their index isn't indexed. It is just a flat file dump of the pages they downloaded. Not an index at all. They claim to have crawled 44B pages. But crawled doesn't mean anything. You can't look at them, can search them, can't even get a copy of them. You also can't get a copy of the tools being used to build your own "index". For anyone wanting to create an internet search function this isn't going to help at all, as there is no search function being offered.

                Furtherall the fluff about free and community is also misleading. There is no community and no open source available for download. They want to sell you a big dump of HTML pages from the web in a text file.

                They claim the text file is a massive 14GB for 600,000 pages. Which is 23KB per page, as their dump is hopelessly inefficient. So if they really did get to 48B pages, this would require 110,000 terabytes of storage. This is about $22 million dollars worth of storage at today's prices!

                Even if you did purchase 50,000 2TB drives, where to do you store them, how do you power them, how do you process the data on them. Whatever you do with this much much data is going to be slow and very expensive.

                Comment

                Working...
                X