Core structure

This commit is contained in:
Matt Jadud
2025-11-29 17:05:21 -05:00
parent b80e2421f1
commit 0fbf88101f
4 changed files with 158 additions and 3 deletions

114
internal/domain64/README.md Normal file
View File

@@ -0,0 +1,114 @@
# domain64
`domain64` is a BIGINT (or 64-bit) type that can be used to encode all domains we are likely to encounter. It represents well as JSonnet/JSON, and can be used in partitioning database tables easily.
## what is it
To encode all of the TLDs, domains, and subdomains we will encounter, we'll use a `domain64` encoding. It maps the entire URL space into a single, 64-bit number (or, `BIGSERIAL` in Postgres).
```mermaid
packet-beta
0-7: "FF | TLD"
8-31: "FFFFFF | Domain"
32-39: "FF | Subdomain"
40-63: "FFFFFF | Path"
```
```
FF:FFFFFF:FF:FFFFFF
```
or
```
tld:domain:subdomain:path
```
or
```
com:jadud:www:teaching:berea
```
can be indexed/partitioned uniquely.
This lets us track
* 255 (#FF) TLDs
* 16,777,216 (#FFFFFF) domains under each TLD
* 255 (#FF) subdomains under each domain
* 16,777,216 (#FFFFFF) paths on a given domain
## what that means
There are only around 10 TLDs that make up the majority of all sites on the internet. The search engine maxes out at tracking 256 unique TLDs (#00-#FF).
Each TLD can hold up to 16M unique sites. There are 302M `.com` domains, meaning , 36M `.cn`, and 20M `.org`. Again, this is for a "personal" search engine, and it is not intended to scale to handling all of the internet. Handling ~ 5% of `.com` (or 75% of `.org`) is *just fine*.
Under a domain, it is possible to uniquely partition off 255 subdomains (where `00` is "no subdomain").
Paths can be indexed uniquely, up to 16M per subdomain.
## example
```
01:000001:00:000000 com.jadud
01:000001:01:000000 gov.jadud.research
01:000001:02:000000 gov.jadud.teaching
01:000001:02:000001 gov.jadud.teaching/olin
01:000001:02:000002 gov.jadud.teaching/berea
```
| tld | domain | sub | path | hex | dec |
| --- | --- | --- | --- | --- | --- |
| com | jadud | _ | _ | #x0100000100000000 | 72057598332895232 |
| com | jadud | research | _ | #x0100000101000000 | 72057598332895488 |
| com | jadud | teaching | _ | #x0100000102000000 | 72057598366449664 |
| com | jadud | teaching | olin | #x0100000102000001 | 72057598366449665 |
| com | jadud | teaching | berea | #x0100000102000002 | 72057598366449666 |
## for partitioning
On a table that contains a `domain64` value, we can partition based on numeric ranges very efficiently.
```sql
CREATE TABLE comjadud PARTITION OF com
FOR VALUES FROM (0x0100000100000000) TO (0x01000001FFFFFFFF);
```
Or
```sql
CREATE TABLE comjadudresearch PARTITION OF com
FOR VALUES FROM (0x0100000101000000) TO (0xx0100000101FFFFFF);
```
## As Jsonnet/JSON
Jsonnet will naturally sort by the hex key values.
```
{
"01": {
"name": "com",
"children": {
"00000001": {
"name": "jadud",
"children": {
"01": "research",
"02": "teaching",
}
}
}
},
"02": {
"name": "org",
"children": {
...
}
}
}
```