I’ve been quiet a minute, but never not working. Lately, I want a homelab replacement for Pocket allowing me to easily bookmark content from desktop or mobile, benefit from AI summarization and auto-tagging, and sync to my mobile device for offline (airplane mode) reading later. No single tool I’ve found does this automatically today, so I glued some pieces together.
This setup uses Karakeep as the bookmark capture tool and Wallabag as the offline reading tool. Both run on a homelab in proxmox LXCs, accessible away from home only via Tailscale.
The goal is not just to copy URLs from Karakeep to Wallabag. That works for many sites, but fails when Wallabag tries to fetch a site that blocks server-side readers with Cloudflare or another bot check.
Instead, this method sends Wallabag the content Karakeep already archived.
Desired flow
1. Bookmark saved in Karakeep
2. Karakeep crawls and archives the page
3. Sync script reads Karakeep's archived HTML asset
4. Sync script posts URL + title + archived HTML to Wallabag
5. Wallabag stores the article for offline reading
Why this exists
Wallabag is good for offline reading, but its normal API flow is URL-based:
1. POST URL to Wallabag
2. Wallabag fetches the page itself
That can fail on sites that present bot checks, JavaScript challenges, or Cloudflare interstitials.
Karakeep, especially when saving through the browser extension, may already have a usable archived copy of the page. Karakeep exposes that saved content as an asset. The sync script uses that asset instead of asking Wallabag to fetch the page again.
Assumptions
This example assumes:
Karakeep URL: https://karakeep.example.internal
Wallabag URL: http://wallabag.internal:8000
Script path: /opt/karakeep-wallabag-sync/sync.py
Env file: /etc/karakeep-wallabag-sync.env
State file: /var/lib/karakeep-wallabag-sync/synced.json
Replace these with your own values.
For internal homelab use, it may be simpler to use direct internal HTTP URLs between containers instead of routing through a reverse proxy with internal TLS certificates.
1. Create API credentials
Karakeep
Create a Karakeep API key from the Karakeep UI.
Example placeholder:
KARAKEEP_API_KEY=PASTE_KARAKEEP_API_KEY_HERE
Wallabag
Create a Wallabag API client from Wallabag’s API client management page.
You need:
WALLABAG_CLIENT_ID=PASTE_WALLABAG_CLIENT_ID_HERE
WALLABAG_CLIENT_SECRET=PASTE_WALLABAG_CLIENT_SECRET_HERE
WALLABAG_USERNAME=PASTE_WALLABAG_USERNAME_HERE
WALLABAG_PASSWORD=PASTE_WALLABAG_PASSWORD_HERE
Wallabag’s API uses OAuth password grant in this setup, so the script needs both the API client credentials and the Wallabag user credentials.
2. Install dependencies
Run this on the machine or container where the sync script will live.
In this example, the script runs inside the Karakeep LXC/container.
mkdir -p /opt/karakeep-wallabag-sync
mkdir -p /var/lib/karakeep-wallabag-sync
cd /opt/karakeep-wallabag-sync
python3 -m venv .venv
. .venv/bin/activate
pip install requests
3.Create the environment file
Create:
nano /etc/karakeep-wallabag-sync.env
Example:
KARAKEEP_BASE_URL=https://karakeep.example.internal
KARAKEEP_API_KEY=PASTE_KARAKEEP_API_KEY_HERE
WALLABAG_BASE_URL=http://wallabag.internal:8000
WALLABAG_CLIENT_ID=PASTE_WALLABAG_CLIENT_ID_HERE
WALLABAG_CLIENT_SECRET=PASTE_WALLABAG_CLIENT_SECRET_HERE
WALLABAG_USERNAME=PASTE_WALLABAG_USERNAME_HERE
WALLABAG_PASSWORD=PASTE_WALLABAG_PASSWORD_HERE
STATE_FILE=/var/lib/karakeep-wallabag-sync/synced.json
# Wait at least this long after bookmark creation before syncing.
# This avoids racing Karakeep before its archive assets are ready.
MIN_BOOKMARK_AGE_SECONDS=120
# Ignore tiny archive assets.
MIN_ARCHIVE_CHARS=500
Lock it down:
chmod 600 /etc/karakeep-wallabag-sync.env
4. Create the sync script
Create:
nano /opt/karakeep-wallabag-sync/sync.py
Paste:
#!/usr/bin/env python3
import json
import os
import sys
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import requests
# ----------------------------
# Config from environment file
# ----------------------------
KARAKEEP_BASE_URL = os.environ["KARAKEEP_BASE_URL"].rstrip("/")
KARAKEEP_API_KEY = os.environ["KARAKEEP_API_KEY"]
WALLABAG_BASE_URL = os.environ["WALLABAG_BASE_URL"].rstrip("/")
WALLABAG_CLIENT_ID = os.environ["WALLABAG_CLIENT_ID"]
WALLABAG_CLIENT_SECRET = os.environ["WALLABAG_CLIENT_SECRET"]
WALLABAG_USERNAME = os.environ["WALLABAG_USERNAME"]
WALLABAG_PASSWORD = os.environ["WALLABAG_PASSWORD"]
STATE_FILE = Path(
os.environ.get(
"STATE_FILE",
"/var/lib/karakeep-wallabag-sync/synced.json",
)
)
MIN_BOOKMARK_AGE_SECONDS = int(os.environ.get("MIN_BOOKMARK_AGE_SECONDS", "120"))
MIN_ARCHIVE_CHARS = int(os.environ.get("MIN_ARCHIVE_CHARS", "500"))
KARAKEEP_HEADERS = {
"Authorization": f"Bearer {KARAKEEP_API_KEY}",
"Accept": "application/json",
}
# ----------------------------
# State file
# ----------------------------
def load_state() -> set[str]:
if not STATE_FILE.exists():
return set()
try:
data = json.loads(STATE_FILE.read_text())
except json.JSONDecodeError:
raise RuntimeError(f"State file is not valid JSON: {STATE_FILE}")
if not isinstance(data, list):
raise RuntimeError(f"State file should contain a JSON list: {STATE_FILE}")
return set(str(x) for x in data)
def save_state(synced_urls: set[str]) -> None:
STATE_FILE.parent.mkdir(parents=True, exist_ok=True)
tmp = STATE_FILE.with_suffix(".tmp")
tmp.write_text(json.dumps(sorted(synced_urls), indent=2))
tmp.replace(STATE_FILE)
# ----------------------------
# Date helpers
# ----------------------------
def parse_karakeep_datetime(value: Any) -> datetime | None:
if not isinstance(value, str) or not value.strip():
return None
raw = value.strip()
if raw.endswith("Z"):
raw = raw[:-1] + "+00:00"
try:
return datetime.fromisoformat(raw)
except ValueError:
return None
def bookmark_age_seconds(bookmark: dict[str, Any]) -> float | None:
created_at = parse_karakeep_datetime(bookmark.get("createdAt"))
if not created_at:
return None
if created_at.tzinfo is None:
created_at = created_at.replace(tzinfo=timezone.utc)
return (datetime.now(timezone.utc) - created_at.astimezone(timezone.utc)).total_seconds()
# ----------------------------
# Encoding helpers
# ----------------------------
def mojibake_score(text: str) -> int:
"""
Higher score means text looks more likely to contain mojibake.
Example of mojibake:
let’s -> letâ??s
"""
bad_markers = [
"â", "Â", "Ã", "?",
"\x80", "\x81", "\x82", "\x83", "\x84", "\x85", "\x86", "\x87",
"\x88", "\x89", "\x8a", "\x8b", "\x8c", "\x8d", "\x8e", "\x8f",
"\x90", "\x91", "\x92", "\x93", "\x94", "\x95", "\x96", "\x97",
"\x98", "\x99", "\x9a", "\x9b", "\x9c", "\x9d", "\x9e", "\x9f",
]
return sum(text.count(marker) for marker in bad_markers)
def repair_mojibake(text: str) -> str:
"""
Repair common UTF-8-as-Latin1/Windows-1252 mojibake.
"""
original_score = mojibake_score(text)
if original_score == 0:
return text
candidates = [text]
for encoding in ("latin1", "cp1252"):
try:
candidates.append(text.encode(encoding).decode("utf-8"))
except UnicodeError:
pass
best = min(candidates, key=mojibake_score)
if mojibake_score(best) < original_score:
print(f" Repaired mojibake: score {original_score} -> {mojibake_score(best)}")
return best
return text
def decode_response_text(response: requests.Response) -> str:
"""
Decode Karakeep archive asset content.
requests may guess ISO-8859-1 for text/html without charset.
Karakeep HTML assets are usually UTF-8, so try UTF-8 first.
"""
raw = response.content
for encoding in ("utf-8-sig", "utf-8"):
try:
return repair_mojibake(raw.decode(encoding).strip())
except UnicodeDecodeError:
pass
guessed = response.encoding or response.apparent_encoding or "utf-8"
return repair_mojibake(raw.decode(guessed, errors="replace").strip())
# ----------------------------
# Karakeep API
# ----------------------------
def karakeep_get(path: str, params: dict[str, Any] | None = None) -> Any:
response = requests.get(
f"{KARAKEEP_BASE_URL}/api/v1{path}",
headers=KARAKEEP_HEADERS,
params=params,
timeout=60,
)
response.raise_for_status()
return response.json()
def karakeep_get_raw(path: str) -> requests.Response:
response = requests.get(
f"{KARAKEEP_BASE_URL}/api/v1{path}",
headers=KARAKEEP_HEADERS,
timeout=120,
)
response.raise_for_status()
return response
def extract_items(data: Any) -> list[dict[str, Any]]:
if isinstance(data, list):
return [x for x in data if isinstance(x, dict)]
if not isinstance(data, dict):
return []
for key in ("bookmarks", "items", "data", "results"):
value = data.get(key)
if isinstance(value, list):
return [x for x in value if isinstance(x, dict)]
return []
def get_next_cursor(data: Any) -> str | None:
if not isinstance(data, dict):
return None
value = (
data.get("nextCursor")
or data.get("next_cursor")
or data.get("cursor")
)
return value if isinstance(value, str) and value.strip() else None
def get_all_karakeep_bookmarks() -> list[dict[str, Any]]:
bookmarks: list[dict[str, Any]] = []
cursor = None
while True:
params: dict[str, Any] = {"limit": 100}
if cursor:
params["cursor"] = cursor
data = karakeep_get("/bookmarks", params=params)
bookmarks.extend(extract_items(data))
cursor = get_next_cursor(data)
if not cursor:
break
return bookmarks
def get_full_karakeep_bookmark(bookmark_id: str) -> dict[str, Any]:
data = karakeep_get(f"/bookmarks/{bookmark_id}")
if not isinstance(data, dict):
raise RuntimeError(f"Unexpected bookmark response for {bookmark_id}")
return data
def extract_url(bookmark: dict[str, Any]) -> str | None:
candidates = [
bookmark.get("url"),
bookmark.get("sourceUrl"),
bookmark.get("canonicalUrl"),
]
content = bookmark.get("content")
if isinstance(content, dict):
candidates.extend(
[
content.get("url"),
content.get("sourceUrl"),
content.get("canonicalUrl"),
]
)
link = bookmark.get("link")
if isinstance(link, dict):
candidates.extend(
[
link.get("url"),
link.get("sourceUrl"),
link.get("canonicalUrl"),
]
)
for url in candidates:
if isinstance(url, str) and url.startswith(("http://", "https://")):
return url.strip()
return None
def extract_title(bookmark: dict[str, Any]) -> str | None:
candidates = [
bookmark.get("title"),
bookmark.get("name"),
]
content = bookmark.get("content")
if isinstance(content, dict):
candidates.extend(
[
content.get("title"),
content.get("name"),
]
)
link = bookmark.get("link")
if isinstance(link, dict):
candidates.extend(
[
link.get("title"),
link.get("name"),
]
)
for value in candidates:
if isinstance(value, str) and value.strip():
return repair_mojibake(value.strip())
return None
def crawl_status(bookmark: dict[str, Any]) -> str | None:
content = bookmark.get("content")
if not isinstance(content, dict):
return None
value = content.get("crawlStatus")
return value if isinstance(value, str) else None
def find_archive_asset_id(bookmark: dict[str, Any]) -> tuple[str, str] | tuple[None, None]:
"""
Prefer Karakeep's extracted readable HTML asset first.
Priority:
1. linkHtmlContent
2. fullPageArchive
"""
assets = bookmark.get("assets")
if isinstance(assets, list):
for asset in assets:
if not isinstance(asset, dict):
continue
if asset.get("assetType") == "linkHtmlContent":
asset_id = asset.get("id")
if isinstance(asset_id, str) and asset_id.strip():
return asset_id.strip(), "linkHtmlContent"
content = bookmark.get("content")
if isinstance(content, dict):
asset_id = content.get("fullPageArchiveAssetId")
if isinstance(asset_id, str) and asset_id.strip():
return asset_id.strip(), "fullPageArchive"
if isinstance(assets, list):
for asset in assets:
if not isinstance(asset, dict):
continue
if asset.get("assetType") == "fullPageArchive":
asset_id = asset.get("id")
if isinstance(asset_id, str) and asset_id.strip():
return asset_id.strip(), "fullPageArchive"
return None, None
def download_karakeep_archive_content(asset_id: str, asset_type: str) -> str | None:
response = karakeep_get_raw(f"/assets/{asset_id}")
content_type = response.headers.get("Content-Type", "unknown")
text = decode_response_text(response)
if len(text) < MIN_ARCHIVE_CHARS:
print(
f" Archive asset too small; skipping for now: "
f"type={asset_type}, id={asset_id}, chars={len(text)}, minimum={MIN_ARCHIVE_CHARS}"
)
return None
if "Just a moment" in text[:3000]:
print(
f" WARNING: archive asset appears to contain a bot-check page: "
f"type={asset_type}, id={asset_id}"
)
print(f" Using Karakeep asset type={asset_type}, id={asset_id}")
print(f" Downloaded archive asset Content-Type={content_type}, chars={len(text)}")
print(f" Archive preview: {text[:250].replace(chr(10), ' ').replace(chr(13), ' ')}")
return text
def extract_inline_archived_content(bookmark: dict[str, Any]) -> str | None:
"""
Only use true inline HTML/text fields.
Do not use content.description here; that is metadata/snippet, not archive content.
"""
candidates = [
bookmark.get("html"),
bookmark.get("contentHtml"),
bookmark.get("text"),
bookmark.get("contentText"),
bookmark.get("fullText"),
]
content = bookmark.get("content")
if isinstance(content, dict):
candidates.extend(
[
content.get("html"),
content.get("htmlContent"),
content.get("contentHtml"),
content.get("text"),
content.get("contentText"),
content.get("fullText"),
]
)
for value in candidates:
if isinstance(value, str):
cleaned = repair_mojibake(value.strip())
if len(cleaned) >= MIN_ARCHIVE_CHARS:
return cleaned
return None
def extract_archived_content(bookmark: dict[str, Any]) -> str | None:
inline = extract_inline_archived_content(bookmark)
if inline:
print(f" Using inline Karakeep archived content, chars={len(inline)}")
print(f" Inline preview: {inline[:250].replace(chr(10), ' ').replace(chr(13), ' ')}")
return inline
asset_id, asset_type = find_archive_asset_id(bookmark)
if asset_id and asset_type:
return download_karakeep_archive_content(asset_id, asset_type)
return None
def should_wait_for_karakeep(bookmark: dict[str, Any]) -> tuple[bool, str]:
age = bookmark_age_seconds(bookmark)
if age is not None and age < MIN_BOOKMARK_AGE_SECONDS:
return True, f"bookmark is too new: age={age:.1f}s, minimum={MIN_BOOKMARK_AGE_SECONDS}s"
status = crawl_status(bookmark)
if status and status.lower() not in {"success", "failed"}:
return True, f"crawlStatus is not finished: {status}"
return False, ""
# ----------------------------
# Wallabag API
# ----------------------------
def get_wallabag_token() -> str:
response = requests.post(
f"{WALLABAG_BASE_URL}/oauth/v2/token",
data={
"grant_type": "password",
"client_id": WALLABAG_CLIENT_ID,
"client_secret": WALLABAG_CLIENT_SECRET,
"username": WALLABAG_USERNAME,
"password": WALLABAG_PASSWORD,
},
timeout=60,
)
response.raise_for_status()
data = response.json()
token = data.get("access_token")
if not isinstance(token, str) or not token:
raise RuntimeError(f"Wallabag token response did not include access_token: {data}")
return token
def wallabag_add_entry(
token: str,
url: str,
title: str | None,
content: str,
) -> None:
content = repair_mojibake(content)
data = {
"url": url,
"content": content,
"tags": "karakeep,auto-import,karakeep-archive",
}
if title:
data["title"] = repair_mojibake(title)
print(f" Sending to Wallabag: url={url}")
print(f" Sending title: {title or '(none)'}")
print(f" Sending content chars: {len(content)}")
response = requests.post(
f"{WALLABAG_BASE_URL}/api/entries.json",
headers={
"Authorization": f"Bearer {token}",
"Accept": "application/json",
},
data=data,
timeout=120,
)
if response.status_code not in (200, 201):
raise RuntimeError(
f"Wallabag failed for {url}: {response.status_code} {response.text}"
)
try:
result = response.json()
wallabag_id = result.get("id")
stored_title = result.get("title")
stored_content = result.get("content")
print(f" Wallabag response status: {response.status_code}")
print(f" Wallabag entry id: {wallabag_id}")
print(f" Wallabag stored title: {stored_title}")
if isinstance(stored_content, str):
stored_content = repair_mojibake(stored_content)
print(f" Wallabag response content chars: {len(stored_content)}")
print(f" Wallabag response preview: {stored_content[:250].replace(chr(10), ' ').replace(chr(13), ' ')}")
else:
print(" Wallabag response content chars: unavailable")
except Exception:
print(f" Wallabag response status: {response.status_code}")
print(" Wallabag response was not parsed as JSON.")
# ----------------------------
# Main
# ----------------------------
def main() -> int:
synced = load_state()
print("Loading Karakeep bookmarks...")
bookmarks = get_all_karakeep_bookmarks()
print(f"Found {len(bookmarks)} Karakeep bookmarks.")
print("Getting Wallabag token...")
token = get_wallabag_token()
added = 0
skipped_synced = 0
skipped_too_new = 0
skipped_not_ready = 0
skipped_no_url = 0
skipped_no_archive = 0
failed = 0
for bookmark in bookmarks:
bookmark_id = bookmark.get("id")
url = extract_url(bookmark)
if not isinstance(bookmark_id, str) or not bookmark_id.strip():
continue
if not url:
skipped_no_url += 1
continue
if url in synced:
skipped_synced += 1
continue
print()
print(f"Processing bookmark id={bookmark_id}")
print(f"URL: {url}")
try:
full_bookmark = get_full_karakeep_bookmark(bookmark_id)
should_wait, reason = should_wait_for_karakeep(full_bookmark)
if should_wait:
if "too new" in reason:
skipped_too_new += 1
else:
skipped_not_ready += 1
print(f" Result: skipping for now; {reason}")
continue
title = extract_title(full_bookmark) or extract_title(bookmark)
content = extract_archived_content(full_bookmark)
if not content:
skipped_no_archive += 1
print(" Result: skipping for now; no usable Karakeep archive content found")
continue
print(" Result: adding with Karakeep archive content")
wallabag_add_entry(
token=token,
url=url,
title=title,
content=content,
)
synced.add(url)
added += 1
except Exception as exc:
failed += 1
print(f"FAILED: {url}: {exc}", file=sys.stderr)
save_state(synced)
print()
print(
"Done. "
f"bookmarks={len(bookmarks)}, "
f"added={added}, "
f"skipped_synced={skipped_synced}, "
f"skipped_too_new={skipped_too_new}, "
f"skipped_not_ready={skipped_not_ready}, "
f"skipped_no_url={skipped_no_url}, "
f"skipped_no_archive={skipped_no_archive}, "
f"failed={failed}"
)
return 1 if failed else 0
if __name__ == "__main__":
raise SystemExit(main())
Make it executable:
chmod +x /opt/karakeep-wallabag-sync/sync.py
5. Test manually
Load the environment file and run the script:
cd /opt/karakeep-wallabag-sync
set -a
. /etc/karakeep-wallabag-sync.env
set +a
. .venv/bin/activate
python -m py_compile sync.py
./sync.py
Expected successful output includes lines like:
Using Karakeep asset type=linkHtmlContent
Downloaded archive asset Content-Type=text/html, chars=...
Sending content chars: ...
Wallabag response status: 200
If a bookmark was created too recently, the script should skip it:
Result: skipping for now; bookmark is too new
If Karakeep has not created an archive asset yet, the script should skip it:
Result: skipping for now; no usable Karakeep archive content found
The script should not send URL-only entries to Wallabag.
6. Automate with systemd
Create the service:
cat >/etc/systemd/system/karakeep-wallabag-sync.service <<'EOF'
[Unit]
Description=Sync Karakeep bookmarks to Wallabag using Karakeep archived content
Wants=network-online.target
After=network-online.target
[Service]
Type=oneshot
EnvironmentFile=/etc/karakeep-wallabag-sync.env
WorkingDirectory=/opt/karakeep-wallabag-sync
ExecStart=/opt/karakeep-wallabag-sync/.venv/bin/python /opt/karakeep-wallabag-sync/sync.py
StandardOutput=journal
StandardError=journal
EOF
Create the timer:
cat >/etc/systemd/system/karakeep-wallabag-sync.timer <<'EOF'
[Unit]
Description=Run Karakeep to Wallabag sync every 10 minutes
[Timer]
OnBootSec=2min
OnUnitActiveSec=10min
Unit=karakeep-wallabag-sync.service
Persistent=true
[Install]
WantedBy=timers.target
EOF
Enable it:
systemctl daemon-reload
systemctl enable --now karakeep-wallabag-sync.timer
Check the timer:
systemctl status karakeep-wallabag-sync.timer
systemctl list-timers | grep karakeep
Force a run:
systemctl start karakeep-wallabag-sync.service
View logs:
journalctl -u karakeep-wallabag-sync.service -n 150 --no-pager
Follow logs live:
journalctl -u karakeep-wallabag-sync.service -f
7. How duplicate prevention works
The script keeps a local JSON file of URLs already synced to Wallabag:
/var/lib/karakeep-wallabag-sync/synced.json
A URL is added to that file only after Wallabag accepts the entry.
This matters because skipped bookmarks should be retried later. For example:
Bookmark too new - skipped, not marked synced
No archive yet - skipped, not marked synced
Wallabag API failure - failed, not marked synced
Wallabag success - marked synced
To force a re-test of a URL:
- Delete the entry from Wallabag.
- Remove the URL from
synced.json. - Run the script again.
8. Race-condition protection
A bookmark can appear in Karakeep before its archive assets are ready.
This script avoids that by using two guards.
First, it waits for a minimum bookmark age:
MIN_BOOKMARK_AGE_SECONDS=120
Second, it checks Karakeep’s crawl status and skips unfinished bookmarks.
The goal is to avoid this bad sequence:
1. Bookmark created
2. Sync runs immediately
3. No Karakeep archive exists yet
4. Script sends URL-only to Wallabag
5. Wallabag fetches blocked page
6. Bad article is stored
Instead, the script does this:
1. Bookmark created
2. Sync runs too early
3. Script skips it
4. Next timer run retries
5. Karakeep archive exists
6. Script sends archived content to Wallabag
9. Encoding repair
Some pages may contain mojibake such as:
letâ??s
instead of:
let’s
The script attempts to repair common UTF-8 decoded as Latin-1 or Windows-1252 before sending content to Wallabag.
This is handled by:
repair_mojibake(...)
It is not a universal character-encoding solution, but it handles the common apostrophe/quote corruption case.
10. Summary
The finished setup uses:
Karakeep as the capture/archive source
Wallabag as the offline reader
Python as the bridge
systemd timer as the scheduler
The important behavior is that the script sends Karakeep’s archived HTML content to Wallabag, not just the URL.
That lets Wallabag store readable articles even when Wallabag’s own fetcher would be blocked by the site.