[🇭🇺 magyar verzió]

A place for everyone who wants to know more about the Hungarian web – either for business purposes or scientific interests.

The site features interesting articles and offers data about the Hungarian-language web’s past, present and future.

The initiative is currently backed by József Jároli, but everyone has the opportunity to join the initiative, by guest-posting or sharing their knowledge and data in other ways.

If you are interested in the data or expertise behind the articles on the site, call at +36-70-512-9874 or write to the email address: j at jaroli.hu

The articles and data found on this website can be freely re-used as long as you adhere to the Creative Commons Attribution 4.0 International License.

The website’s design is defined by the sixpack WordPress theme.

There is surprisingly little information about the size of the Hungarian web. Perhaps because for those who deal with websites on a daily basis, this question is about like why is the sun shining. We never really think about it, the answer may seem simple, but the more we immerse ourselves in the subject, the less it is.

To answer this question, we must first clarify what we consider to be Hungarian and a website. In my opinion: a web presence that has substantial content in Hungarian, published on a unique domain name.

860 thousand Hungarian domain names?

I think that we can safely consider the existence of a unique domain name as a minimum requirement, since how can you take content seriously if its owner has not even invested a couple of euros in publishing it?

According to domain.hu‘s statistics (which is the official .hu registry), in February 2023 approx. 860,000 domain names were registered under the .hu ccTLD. However, this does not mean at all that there are that many active Hungarian websites indeed. Many people keep a domain name with high hopes for selling it for a lot of money one day, or simply to prevent it from being owned by someone else. In addition to this, many owners have not yet completed their website, or do not even want to publish a website on the domain, because e.g. it is only used for email addresses.

In my database, 730,000 .hu domain names have ever worked, so I think that I have sufficient data to determine the final numbers.

482 thousand working websites?

If we try to look up these hundreds of thousands of domain names, a little more than half of them will give us a sign of life — that is, if we entered these addresses into the browser, we would get some kind of answer from the web server in this case. Of course, a list of domain names of this magnitude can only be managed in an automated way, with the help of scripts.

290 thousand websites with significant content?

Many times, the web page that appears as a response will not contain significant information, only an error message or some default home page. For the sake of example, in this group there are also a few cases that can be named independently when we also cannot talk about substantive content, and therefore not even about an independent website:

-21,000 default CMS home pages: the owners have set up some kind of content management system for their domain, but have not yet started to fill it with their own content, so the page is practically empty, even though the engine behind it is ready to go (“Welcome to WordPress! This is the first entry” and alike.)

-11,000 parked domains: However, many domains are just parked. In such cases, there is no meaningful content on the site; usually, we can only be informed about the fact that the given domain name is for sale/rent. In addition, sometimes the same content is available on several domain names: I don’t think these should be counted as separate websites either.

– 6,000 domains using outdated technologies: e.g. the so-called websites with frames, or those old pages made with Macromedia/Adobe Flash, where the navigation was made exclusively with this technology that is not handled by today’s modern desktop and mobile browsers, so we can rightly consider them as abandoned pages.

250,000 Hungarian-language websites under the .hu domain?

Based on the method I used to compile the database that backs the pedia.hu website, roughly this is the number of Hungarian-language websites that can be found under the .hu domain.

To this we can add a few tens of thousands of domains with other top-level domain than .hu, which are primarily general endings, such as e.g. .com, .eu or sites registered under the domain endings of neighboring countries. It is much more difficult to find these sites, so it is also more difficult to estimate their number. While we can safely assume that a website hosted under the .hu ccTLD is probably written in Hungarian or related to Hungary, on the other hand, it is highly questionable whether we should count certain websites as unique Hungarian sites, for example, all Google domains, that are registered under many top-level domains and have a Hungarian interface.

So with this, we have reached a grey zone, where on one side there are non-Hungarian websites of Hungarian people, companies, and organizations, and on the other side foreign websites that are also available in Hungarian, often through low-quality automatic translations. Or what if someone has Hungarian ancestors and therefore a typical family name, but this is all, or if it’s about an other site that deals with a dog breed of Hungarian origin, should we consider their sites as part of the Hungarian web?Here again, everything depends on how exactly we define what is Hungarian and what is a website.

Three hundred thousand Hungarian websites — is that all?

It is not possible to exactly determine how many Hungarian websites there are, since even the discovery of an active domain in itself runs into many technical difficulties, and among the nearly half a million domains that are in use in one way or another, there will always be one that started yesterday, or that shut down yesterday, or just it was not available yesterday due to an error, so it is not included in the statistics. And of course if we want to be very strict, e.g. then we could label quite a few websites that have not been touched for years as inactive and therefore too obsolete to make it into the list.

However, it can be stated with great certainty that when we look for the answer to the question of how many active Hungarian websites exist, we cannot talk about millions, or even half a million. Taking the above into account, so stating that there are three hundred thousand active Hungarian websites in total, can be a good approximation.

Of course, these three hundred thousand websites are also very different in size, as there are many sites consisting of only one web page, e.g. the Hungarian language Wikipedia, which has more than half a million web pages, and this fact makes the estimation quite difficult.

Of course, if you like, you could also add to the total more web presences that do not meet the criteria I outlined above: for instance, the sites hosted on blog farms or even further sites with unique content that are hosted on subdomains, and ultimately you could think about counting Facebook pages too since many companies and organizations have solely a web presence on social media sites.

How accurate is this estimate?

As I mentioned, the database serving as a starting point is comparable in size to what the domain.hu statistics show. However, there is another way to determine what proportion of existing pages have been discovered, namely by systematically querying all technically possible domain names. For example, if we examine domain names with a length of 4 characters, then taking into account the 26 letters of the English alphabet, the 10 numbers and the hyphen (which cannot be at the beginning or end), we get 363737*36 = 1,774,224 variations.

Well, out of these domain names, there were 8,600 working domains in the database before this double check, which only increased by 500 after the crawling through these more than one and a half million variations, so it is highly likely that if I had systematically checked the longer domain names as well, then I could only find a similar number of undiscovered domains, that would have yielded 6% more websites, which would modify the total sum of 250,000 to 265,000, and this would not significantly alter the final estimate of roughly 300,000.

If we want to get answers to questions such as how many websites exist in a given language, how many webshops there are, or even which are the most popular content management systems, we don’t necessarily have to start from scratch: we can call on the work of organizations that archive information on the web, scan the internet for security research, collect data for marketers about websites, or organize information either encyclopedically, or in the form of a searchable database.

Data from the archivers of the web

There are two important organizations that archive a significant part of the web. One is the well-known Internet Archive, and the other is Common Crawl, which is more of interest to a professional audience.

The Internet Archive continuously crawls and saves web pages found on the web, and also makes earlier versions of individual web pages available on an easy-to-use web interface. Its database therefore contains a large number of defunct domain names and web pages. Nevertheless, due to its large coverage it is a very useful tool, e.g. for those who aims to collect data from a certain segment of the web.

Common Crawl crawls the web roughly every two months, and publishes the collected web pages in a separate data collection. For this reason, a very large proportion of the collected data consists of web pages that are still operating, even though the overlap between the individual collections is not necessarily 100%, so it is worth working with results originating from multiple crawls. The raw data is also further processed in many places and forms, e.g. see Web Data Commons.

Both organizations allow you to retrieve an index of the web pages they collect in CDX format, which is a good starting point for further investigation, but Common Crawl also publishes domain-level databases too.

A typical example of a local web archiving initiative is the Hungarian National Széchényi Library’s web archive aiming to archive the Hungarian web. Unfortunately, the harvested data is not available for further use due to legal reasons, unlike the above-mentioned web archiving projects.

Databases for Internet Security Research

When you want to know more about the web at scale, it is useful to get acquainted with cybersecurity professionals and their work. For instance, the Project Sonar has the goal of scanning all IP addresses and the services available on those addresses and publishing its results on the Rapid7 Open Data database website. By scanning the servers and their ports that serve websites, we can get a lot of domain names as a byproduct. The data created by Project Sonar used to be publicly available, but nowadays it is only accessible for specific purposes and audiences.

Primarily for security research purposes, Censys also offers its data obtained with a similar “we scan everything” approach. This project not only gives researchers access to its databases, but makes its database searchable according to many parameters through an online search interface, so even if we cannot access all of their data, we can still gain insight into what information is collected there.

Another player in this league is netlas.io. On its intuitive search interface, you can search for any IP address or domain name, including all .hu domains, of which 645,724 were registered at the time of writing. You can get limited access to the data with a free tier or different paid subscriptions.

And if someone wants to scan everything themselves, it might be useful to familiarize yourself with the Zmap project.

SEO link databases

Search engine optimization professionals also need lots and lots of data about different websites. The most useful information for our purposes is the data about who is linking to whom. For instance, since Hungarian-language sites most often link to Hungarian-language sites, and the majority of external links to a specific Hungarian site will also be in Hungarian, this can also help a lot in discovering new websites — especially if, for example, we want to collect domains by language and not by domain name ending. Bigger databases are offered by, for example, Ahrefs, Majestic SEO or SEO Spyglass.

Collectors of encyclopedic data

Due to their nature, Wikipedia or OpenStreetMap do not store as many domain names or websites like the sources mentioned above. Their big advantage, however, is that we can find these URLs in a much more structured data environment. Last but not least, they allow you to download their raw database, so you don’t have to scrape their websites to get the data.

Search engine databases

The most up-to-date and the largest amount of data is collected and made available on the web by web search sites, such as Google is Bing or Yandex. The only problem is that their service, unlike the above-mentioned search interface of netlas.io, was not designed in such a way that querying their database could directly benefit our research, so it is only possible to extract data from these databases by detours: figuring out which typical search queries can be used to find the most new and unknown domains, and then perform searches slowly or distribute the queries over several places, in order not to get banned during the process.

Social media platforms

Also, don’t forget that more and more information on the web is moving from the open web to the closed, walled gardens such as social media platforms. These platforms often refer to the open web, that is, to traditional websites hosted on their own domain names too, so these platforms could be an ideal basis for our discovery too. The problem in this case is that scraping these platforms is even more problematic, both from a technical and legal point of view.

More options

Like the real world, the web is also extremely diverse, so it is impossible to list all the possibilities through which we can explore the web, or a specific segment of it. In the above article, I have just scraped the surface by mentioning some of the larger databases according to their main types, definitely not trying to list all available service providers within these categories.

In addition to all of this, there are also many data sources of general or local interest that can help us, such as:

  • It might not help with all domain name endings, but e.g. the .hu registrar regularly publishes the newly registered domain names on its website, so it is perhaps worth checking the domain registrars’ website too.
  • Although link directories are out of fashion these days, in Hungary, there are resources such as the lap.hu pages, from which we can easily extract relatively structured data.
  • Last but not least, it is possible to crawl the web without using any of the above data sources. Applying this method could also help us assess how much we could discover from the web by initiating our work using data originating from other’s databases.

Discovering the web from scratch

In the first step, we enumerate the characters that can be used in domain names [a-z0-9-] and mix up to generate all technically possible variations: 37*37 + 36*37*36 + 36*37*37*36 + 36*37*37*37*36 + …

It is obvious that if we only want to comb through every domain name consisting of a maximum of five letters, we have already exceeded the 67 million possible combinations, which at first seems like a huge number, but since most domains will not be in use, the answer will quickly come back that the url does not exist, so you can relatively quickly go through these.

If we are done with this step, then as a by-product we will also be able to discover all kinds of other domain names to which some of these short domain names redirect. However, most new domains longer than five characters will be discovered by parsing the web pages we just downloaded and checking where are these sites linking to.

If we already have a substantial amount of working domain names, including longer ones, we can continue scanning the permutations resulting in longer domain names by excluding from the combinations those that do not resemble potentially meaningful domain names at all, such as 4h6u3a.hu — for example based on the probability of two characters following each other in already known domain names, or checking the occurrences by character pairs, triplets, etc.

Conclusion

Depending on our financial possibilities and determination, we can work with many different data sources if we want to collect a large amount of websites for some purpose, e.g. we are looking for an answer to a basic question like how many Hungarian websites exist. Some of them already provide the necessary information without further need of transforming it, while from other sources we have to extract the data taking into consideration the peculiarities of the data collection process and the data format, and finally, the most difficult case is when we have to write our own script in order to extract the publicly available information we are interested in from many webpages.

The choice of domain name determines the future of a website in the long term, so it is telling what words website owners consider so important that they include them in their domain name, which is the basis of their web presence.

If we examine the active .hu domain names, the order of popularity of the words used is as follows (top 200):

HungarianEnglish
autocar
shopshop
budaBuda
techtech
designdesign
kertgarden
boltstore
budapestBudapest
iskolaschool
studiostudio
helyplace
pestPest
ingatlanproperty
szervizservice
butorfurniture
sportsport
fotophoto
klimaair conditioning
nagylarge
apartmanapartment
konyvbook
dentdent
vendeghazguesthouse
magyarHungarian
onlineonline
pontpoint
gyorGyőr
epitbuild
hotelhotel
parkpark
infoinfo
homehome
marketmarket
hungaryHungary
szentholy
munkawork
mediamedia
pecsPécs
tetoroof
villavilla
irodaoffice
mestermaster
ablakwindow
szigetisland
balatonBalaton
centrumcenter
transtrans
szegedSzeged
vilagworld
gyogyheal
debrecenDebrecen
keressearch
mobilmobile
viragflower
profiprofessional
euroeuro
otthonhome
festpaint
gyarfactory
dekordecor
alapitvanyfoundation
centercenter
landland
teamteam
szerelrepair
szepbeautiful
szalonsalon
panziobed and breakfast
jatekgame
klubclub
muhelyworkshop
allasjob
dunaDanube
gumitire
masszazsmassage
marketingmarketing
tarspartner
interinter
mindall
zoldgreen
digidigi
aruhazwarehouse
szallasaccommodation
eskuvowedding
aranygold
ugyvedlawyer
fenylight
photophoto
feherwhite
sulischool
ovodakindergarten
kozmetikacosmetics
etteremrestaurant
foldEarth
starstar
groupgroup
motormotor
consultconsult
webshopwebshop
konyvelaccount
plusplus
hangvoice
naturnatural
tervplan
egyesuletassociation
egermouse
kozpontcenter
beautybeauty
szabotailor
tradetrade
miskolcMiskolc
clubclub
babababy
coachcoach
kutyadog
workwork
iparindustry
allatanimal
ruhadress
uzletbusiness
greengreen
ekszerjewellery
orvosdoctor
trendtrend
kataKata
varoscity
pincecellar
elektroelectro
kapugate
konyhakitchen
technikatechnique
egeszseghealth
serviceservice
zalaZala
autosiskoladriving school
ajtodoor
hegymountain
blogblog
smartsmart
lakasflat
filmmovie
agroagro
planplan
cleanclean
systemsystem
solarsolar
optikaoptics
zenemusic
uvegglass
storestore
portaporta
pizzapizza
berlesrent
printprint
magazinmagazine
kovacsblacksmith
goldgold
digitaldigital
gardengarden
pannonpannon
tiszaTisza
penzmoney
jogayoga
lifelife
sopronSopron
patikapharmacy
mesetale
tamasThomas
csaladfamily
pszichopsycho
fitnessfitness
wellwell
bestbest
traveltravel
csillaCsilla
citycity
almaapple
mentesrescue
kiralyking
kerekround
metalmetal
akademiaacademy
reklamadvertising
peterPeter
aquaaqua
villanyelectricity
okosclever
partnerpartner
farmfarm
oktateducate
thermtherm
varazsmagic
videovideo
olcsocheap
annaAnna
gyerekchild
betonconcrete
househouse
gazdafarmer
tancdance

An interesting result is that the word auto became the most popular, even overshadowing webshops. Most frequently, car dealerships, car service centres, car washes, and driving schools are behind such domains, but automation and similar topics also contribute to the word’s high occurrence.

Online ranking of cities

Larger cities and other geographical names also appear in the list. The order that shows the importance of these places, in other words, which places or areas are covered the most on the Hungarian web is the following: Buda, Budapest, Pest, Győr, Pécs, Balaton, Szeged, Debrecen, Danube, Eger, Miskolc, Zala, Pannon, Tisza, Sopron.

Naming trends

Browsing the list, we can also identify some words indicating trends in company naming and product/brand name selection, such as tech, studio, dent, centrum, trans, profi, center, land, team, star, consult, plus.

Own brand or legible domain name?

Caveat: the results do not clearly show the popularity of individual topics and business segments, since domain names that have their own brand, a fancy name or an abbreviation behind them did not, or could not influence the above result in the right direction.

Therefore, if we want to state clearly what kind of popularity this list measures, we have to say that we can see here the choices of those website owners who either could not or did not want to invest in bringing such brand to the market that is not understandable at first glance, or in order to make their domain name expressive, findable, and even search engine optimized, they chose names that contain their most important keywords.

Longest Hungarian domain names

In the “competition” of who has more keywords in the domain name, or more precisely in the “who has the longest domain name” competition, the following domain names (or the operating websites behind them) won:

nagy-balint-villanyszereles-zala-megye-es-vonzaskorzete.hu

(Bálint Nagy electrician Zala county and agglomeration)

egeszseg-studio-termeszetes-modszerek-az-egeszsegert.hu

(health studio natural methods for the health)

csoda-mester-eletero-boltja-az-orok-fiatalsaghoz.hu

(wonder master vitality shop for the eternal youth)

ikkk-itavaccs-kozosegi-kozlekedesi-kozpont.hu

(abbreviation abbreviation communal traffic centre)

tothvillanyszerelesesbiztonsagtechnikaikft.hu

(Tóth electrical installation safety technology ltd.)

orszagoskismotorfecskendoszerelobajnoksag.hu

(national small motorcycle injector assembly championship)

Outside the competition, but also worth mentioning is the following domain that currently displays content from another domain, so it does not have its own, unique content.:

ezittaleghosszabbertelmesdomainmagyarorszagonmertmiertnelenneaz.hu.hu

(this is the longest meaningful domain in Hungary because why would not that be)

How did I get the results?

From the pedia.hu database, I queried the two and a half hundred thousand domain names that have significant, own, and porn-free content.

Since it is not clear where the word boundaries are in most multi-word domain names, I wrote an algorithm that first sorts domain names (and words separated from hyphenated domain names) in descending order by length. In the next step, starting with the longest words, searches for these words in the at the beginning or end of the domain names, and finally repeats the process with the remainder after subtracting the words identified.