A lot more to go, but this is the core. Need to think about how to test the queue handlers.
domain64
domain64 is a BIGINT (or 64-bit) type that can be used to encode all domains we are likely to encounter. It represents well as JSonnet/JSON, and can be used in partitioning database tables easily.
what is it
To encode all of the TLDs, domains, and subdomains we will encounter, we'll use a domain64 encoding. It maps the entire URL space into a single, 64-bit number (or, BIGSERIAL in Postgres).
packet-beta
0-7: "FF | TLD"
8-31: "FFFFFF | Domain"
32-39: "FF | Subdomain"
40-63: "FFFFFF | Path"
FF:FFFFFF:FF:FFFFFF
or
tld:domain:subdomain:path
or
com:jadud:www:teaching:berea
can be indexed/partitioned uniquely.
This lets us track
- 255 (#FF) TLDs
- 16,777,216 (#FFFFFF) domains under each TLD
- 255 (#FF) subdomains under each domain
- 16,777,216 (#FFFFFF) paths on a given domain
what that means
There are only around 10 TLDs that make up the majority of all sites on the internet. The search engine maxes out at tracking 256 unique TLDs (#00-#FF).
Each TLD can hold up to 16M unique sites. There are 302M .com domains, meaning , 36M .cn, and 20M .org. Again, this is for a "personal" search engine, and it is not intended to scale to handling all of the internet. Handling ~ 5% of .com (or 75% of .org) is just fine.
Under a domain, it is possible to uniquely partition off 255 subdomains (where 00 is "no subdomain").
Paths can be indexed uniquely, up to 16M per subdomain.
example
01:000001:00:000000 com.jadud
01:000001:01:000000 gov.jadud.research
01:000001:02:000000 gov.jadud.teaching
01:000001:02:000001 gov.jadud.teaching/olin
01:000001:02:000002 gov.jadud.teaching/berea
| tld | domain | sub | path | hex | dec |
|---|---|---|---|---|---|
| com | jadud | _ | _ | #x0100000100000000 | 72057598332895232 |
| com | jadud | research | _ | #x0100000101000000 | 72057598332895488 |
| com | jadud | teaching | _ | #x0100000102000000 | 72057598366449664 |
| com | jadud | teaching | olin | #x0100000102000001 | 72057598366449665 |
| com | jadud | teaching | berea | #x0100000102000002 | 72057598366449666 |
for partitioning
On a table that contains a domain64 value, we can partition based on numeric ranges very efficiently.
CREATE TABLE comjadud PARTITION OF com
FOR VALUES FROM (0x0100000100000000) TO (0x01000001FFFFFFFF);
Or
CREATE TABLE comjadudresearch PARTITION OF com
FOR VALUES FROM (0x0100000101000000) TO (0xx0100000101FFFFFF);
As Jsonnet/JSON
Jsonnet will naturally sort by the hex key values.
{
"01": {
"name": "com",
"children": {
"00000001": {
"name": "jadud",
"children": {
"01": "research",
"02": "teaching",
}
}
}
},
"02": {
"name": "org",
"children": {
...
}
}
}