Category: Data sources

If we want to get answers to questions such as how many websites exist in a given language, how many webshops there are, or even which are the most popular content management systems, we don’t necessarily have to start from scratch: we can call on the work of organizations that archive information on the web, scan the internet for security research, collect data for marketers about websites, or organize information either encyclopedically, or in the form of a searchable database.

Data from the archivers of the web

There are two important organizations that archive a significant part of the web. One is the well-known Internet Archive, and the other is Common Crawl, which is more of interest to a professional audience.

The Internet Archive continuously crawls and saves web pages found on the web, and also makes earlier versions of individual web pages available on an easy-to-use web interface. Its database therefore contains a large number of defunct domain names and web pages. Nevertheless, due to its large coverage it is a very useful tool, e.g. for those who aims to collect data from a certain segment of the web.

Common Crawl crawls the web roughly every two months, and publishes the collected web pages in a separate data collection. For this reason, a very large proportion of the collected data consists of web pages that are still operating, even though the overlap between the individual collections is not necessarily 100%, so it is worth working with results originating from multiple crawls. The raw data is also further processed in many places and forms, e.g. see Web Data Commons.

Both organizations allow you to retrieve an index of the web pages they collect in CDX format, which is a good starting point for further investigation, but Common Crawl also publishes domain-level databases too.

A typical example of a local web archiving initiative is the Hungarian National Széchényi Library’s web archive aiming to archive the Hungarian web. Unfortunately, the harvested data is not available for further use due to legal reasons, unlike the above-mentioned web archiving projects.

Databases for Internet Security Research

When you want to know more about the web at scale, it is useful to get acquainted with cybersecurity professionals and their work. For instance, the Project Sonar has the goal of scanning all IP addresses and the services available on those addresses and publishing its results on the Rapid7 Open Data database website. By scanning the servers and their ports that serve websites, we can get a lot of domain names as a byproduct. The data created by Project Sonar used to be publicly available, but nowadays it is only accessible for specific purposes and audiences.

Primarily for security research purposes, Censys also offers its data obtained with a similar “we scan everything” approach. This project not only gives researchers access to its databases, but makes its database searchable according to many parameters through an online search interface, so even if we cannot access all of their data, we can still gain insight into what information is collected there.

Another player in this league is netlas.io. On its intuitive search interface, you can search for any IP address or domain name, including all .hu domains, of which 645,724 were registered at the time of writing. You can get limited access to the data with a free tier or different paid subscriptions.

And if someone wants to scan everything themselves, it might be useful to familiarize yourself with the Zmap project.

SEO link databases

Search engine optimization professionals also need lots and lots of data about different websites. The most useful information for our purposes is the data about who is linking to whom. For instance, since Hungarian-language sites most often link to Hungarian-language sites, and the majority of external links to a specific Hungarian site will also be in Hungarian, this can also help a lot in discovering new websites — especially if, for example, we want to collect domains by language and not by domain name ending. Bigger databases are offered by, for example, Ahrefs, Majestic SEO or SEO Spyglass.

Collectors of encyclopedic data

Due to their nature, Wikipedia or OpenStreetMap do not store as many domain names or websites like the sources mentioned above. Their big advantage, however, is that we can find these URLs in a much more structured data environment. Last but not least, they allow you to download their raw database, so you don’t have to scrape their websites to get the data.

Search engine databases

The most up-to-date and the largest amount of data is collected and made available on the web by web search sites, such as Google is Bing or Yandex. The only problem is that their service, unlike the above-mentioned search interface of netlas.io, was not designed in such a way that querying their database could directly benefit our research, so it is only possible to extract data from these databases by detours: figuring out which typical search queries can be used to find the most new and unknown domains, and then perform searches slowly or distribute the queries over several places, in order not to get banned during the process.

Social media platforms

Also, don’t forget that more and more information on the web is moving from the open web to the closed, walled gardens such as social media platforms. These platforms often refer to the open web, that is, to traditional websites hosted on their own domain names too, so these platforms could be an ideal basis for our discovery too. The problem in this case is that scraping these platforms is even more problematic, both from a technical and legal point of view.

More options

Like the real world, the web is also extremely diverse, so it is impossible to list all the possibilities through which we can explore the web, or a specific segment of it. In the above article, I have just scraped the surface by mentioning some of the larger databases according to their main types, definitely not trying to list all available service providers within these categories.

In addition to all of this, there are also many data sources of general or local interest that can help us, such as:

  • It might not help with all domain name endings, but e.g. the .hu registrar regularly publishes the newly registered domain names on its website, so it is perhaps worth checking the domain registrars’ website too.
  • Although link directories are out of fashion these days, in Hungary, there are resources such as the lap.hu pages, from which we can easily extract relatively structured data.
  • Last but not least, it is possible to crawl the web without using any of the above data sources. Applying this method could also help us assess how much we could discover from the web by initiating our work using data originating from other’s databases.

Discovering the web from scratch

In the first step, we enumerate the characters that can be used in domain names [a-z0-9-] and mix up to generate all technically possible variations: 37*37 + 36*37*36 + 36*37*37*36 + 36*37*37*37*36 + …

It is obvious that if we only want to comb through every domain name consisting of a maximum of five letters, we have already exceeded the 67 million possible combinations, which at first seems like a huge number, but since most domains will not be in use, the answer will quickly come back that the url does not exist, so you can relatively quickly go through these.

If we are done with this step, then as a by-product we will also be able to discover all kinds of other domain names to which some of these short domain names redirect. However, most new domains longer than five characters will be discovered by parsing the web pages we just downloaded and checking where are these sites linking to.

If we already have a substantial amount of working domain names, including longer ones, we can continue scanning the permutations resulting in longer domain names by excluding from the combinations those that do not resemble potentially meaningful domain names at all, such as 4h6u3a.hu — for example based on the probability of two characters following each other in already known domain names, or checking the occurrences by character pairs, triplets, etc.

Conclusion

Depending on our financial possibilities and determination, we can work with many different data sources if we want to collect a large amount of websites for some purpose, e.g. we are looking for an answer to a basic question like how many Hungarian websites exist. Some of them already provide the necessary information without further need of transforming it, while from other sources we have to extract the data taking into consideration the peculiarities of the data collection process and the data format, and finally, the most difficult case is when we have to write our own script in order to extract the publicly available information we are interested in from many webpages.