-
the topic in question says it..... |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Are you adding a URL to a page that contains links that point to .zip files and seeing them download? Or are you adding URLs that are zipfiles directly? Also are you using Please post the full verbatim output of |
Beta Was this translation helpful? Give feedback.
-
my aproach is to use the archivebox exporter extension web plugins. They work again with the dev version. Then i have configured the plugin with "blocklist" approach, which means archive everything as long as the domain is not ignored. Here is my conf 0.7.1
ArchiveBox v0.7.1+editable BUILD_TIME=2023-12-18 06:57:51 1702882671
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.16.0-0.bpo.4-amd64-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=0:0 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False
[i] Dependency versions:
√ PYTHON_BINARY v3.11.7 valid /usr/local/bin/python3.11
√ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py
√ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py
√ ARCHIVEBOX_BINARY v0.7.1 valid /usr/local/bin/archivebox
√ CURL_BINARY v8.4.0 valid /usr/bin/curl
√ WGET_BINARY v1.21.3 valid /usr/bin/wget
- NODE_BINARY - disabled /usr/bin/node
- SINGLEFILE_BINARY - disabled /app/node_modules/single-file-cli/single-file
- READABILITY_BINARY - disabled /app/node_modules/readability-extractor/readability-extractor
- MERCURY_BINARY - disabled /app/node_modules/@postlight/parser/cli.js
- GIT_BINARY - disabled /usr/bin/git
- YOUTUBEDL_BINARY - disabled /usr/local/bin/yt-dlp
- CHROME_BINARY - disabled /usr/bin/chromium-browser
√ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 24 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates
- CUSTOM_TEMPLATES_DIR - disabled None
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled None
- COOKIES_FILE - disabled None
[i] Data locations:
√ OUTPUT_DIR 5 files @ valid /data
√ SOURCES_DIR 25 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 163 files @ valid /archive
√ CONFIG_FILE 490.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 864.0 KB valid ./index.sqlite3 $ archivebox config
[i] [2023-12-20 07:18:46] ArchiveBox v0.7.1: archivebox config
> /data
IS_TTY=True
USE_COLOR=True
SHOW_PROGRESS=True
IN_DOCKER=True
IN_QEMU=False
PUID=911
PGID=911
OUTPUT_DIR=/data
CONFIG_FILE=/data/ArchiveBox.conf
ONLY_NEW=True
TIMEOUT=60
MEDIA_TIMEOUT=3600
OUTPUT_PERMISSIONS=644
RESTRICT_FILE_NAMES=windows
URL_DENYLIST=\.(css|js|otf|ttf|woff|woff2|gstatic\.com|googleapis\.com/css)(\?.*)?$
URL_ALLOWLIST=None
ADMIN_USERNAME=None
ADMIN_PASSWORD=None
ENFORCE_ATOMIC_WRITES=True
TAG_SEPARATOR_PATTERN=[,]
SECRET_KEY=oId15KJwKrY79eouYKnvvRolMsIu_XijJ5eMNOQv6R4DIX09RU
BIND_ADDR=0.0.0.0:8000
ALLOWED_HOSTS=*
DEBUG=False
PUBLIC_INDEX=False
PUBLIC_SNAPSHOTS=False
PUBLIC_ADD_VIEW=True
FOOTER_INFO=Content is hosted for personal archiving purposes only. Contact server owner for any takedown requests.
SNAPSHOTS_PER_PAGE=40
CUSTOM_TEMPLATES_DIR=None
TIME_ZONE=UTC
TIMEZONE=UTC
REVERSE_PROXY_USER_HEADER=Remote-User
REVERSE_PROXY_WHITELIST=
LOGOUT_REDIRECT_URL=/
PREVIEW_ORIGINALS=True
LDAP=False
LDAP_SERVER_URI=None
LDAP_BIND_DN=None
LDAP_BIND_PASSWORD=None
LDAP_USER_BASE=None
LDAP_USER_FILTER=None
LDAP_USERNAME_ATTR=None
LDAP_FIRSTNAME_ATTR=None
LDAP_LASTNAME_ATTR=None
LDAP_EMAIL_ATTR=None
SAVE_TITLE=True
SAVE_FAVICON=False
SAVE_WGET=True
SAVE_WGET_REQUISITES=False
SAVE_SINGLEFILE=False
SAVE_READABILITY=False
SAVE_MERCURY=False
SAVE_HTMLTOTEXT=True
SAVE_PDF=False
SAVE_SCREENSHOT=False
SAVE_DOM=False
SAVE_HEADERS=True
SAVE_WARC=False
SAVE_GIT=False
SAVE_MEDIA=False
SAVE_ARCHIVE_DOT_ORG=False
RESOLUTION=1440,2000
GIT_DOMAINS=github.com,bitbucket.org,gitlab.com,gist.github.com
CHECK_SSL_VALIDITY=True
MEDIA_MAX_SIZE=750m
CURL_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.1 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 8.4.0 (x86_64-pc-linux-gnu)
WGET_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.1 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.3
CHROME_USER_AGENT=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)
COOKIES_FILE=None
CHROME_USER_DATA_DIR=None
CHROME_TIMEOUT=0
CHROME_HEADLESS=True
CHROME_SANDBOX=False
YOUTUBEDL_ARGS=['--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--write-sub', '--write-auto-subs', '--convert-subs=srt', '--yes-playlist', '--continue', '--no-abort-on-error', '--ignore-errors', '--geo-bypass', '--add-metadata', '--format=(bv*+ba/b)[filesize<=750m][filesize_approx<=?750m]/(bv*+ba/b)']
WGET_ARGS=['--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off']
CURL_ARGS=['--silent', '--location', '--compressed']
GIT_ARGS=['--recursive']
SINGLEFILE_ARGS=[]
FAVICON_PROVIDER=https://www.google.com/s2/favicons?domain={}
USE_INDEXING_BACKEND=True
USE_SEARCHING_BACKEND=True
SEARCH_BACKEND_ENGINE=ripgrep
SEARCH_BACKEND_HOST_NAME=localhost
SEARCH_BACKEND_PORT=1491
SEARCH_BACKEND_PASSWORD=SecretPassword
SEARCH_PROCESS_HTML=True
SONIC_COLLECTION=archivebox
SONIC_BUCKET=snapshots
SEARCH_BACKEND_TIMEOUT=90
FTS_SEPARATE_DATABASE=True
FTS_TOKENIZERS=porter unicode61 remove_diacritics 2
FTS_SQLITE_MAX_LENGTH=1000000000
USE_CURL=True
USE_WGET=True
USE_SINGLEFILE=False
USE_READABILITY=False
USE_MERCURY=False
USE_GIT=False
USE_CHROME=False
USE_NODE=False
USE_YOUTUBEDL=False
USE_RIPGREP=True
CURL_BINARY=curl
GIT_BINARY=git
WGET_BINARY=wget
SINGLEFILE_BINARY=/app/node_modules/.bin/single-file
READABILITY_BINARY=/app/node_modules/.bin/readability-extractor
MERCURY_BINARY=/app/node_modules/.bin/postlight-parser
YOUTUBEDL_BINARY=yt-dlp
NODE_BINARY=node
RIPGREP_BINARY=rg
CHROME_BINARY=chromium-browser
POCKET_CONSUMER_KEY=None
USER=archivebox
PACKAGE_DIR=/app/archivebox
TEMPLATES_DIR=/app/archivebox/templates
ARCHIVE_DIR=/data/archive
SOURCES_DIR=/data/sources
LOGS_DIR=/data/logs
URL_DENYLIST_PTN=re.compile('\\.(css|js|otf|ttf|woff|woff2|gstatic\\.com|googleapis\\.com/css)(\\?.*)?$', re.IGNORECASE|re.MULTILINE)
URL_ALLOWLIST_PTN=None
DIR_OUTPUT_PERMISSIONS=755
ARCHIVEBOX_BINARY=/usr/local/bin/archivebox
VERSION=0.7.1
COMMIT_HASH=None
BUILD_TIME=2023-12-18 06:57:51 1702882671
PYTHON_BINARY=/usr/local/bin/python
PYTHON_ENCODING=UTF-8
PYTHON_VERSION=3.11.7
DJANGO_BINARY=/usr/local/lib/python3.11/site-packages/django/__init__.py
DJANGO_VERSION=3.1.14 final (0)
SQLITE_BINARY=/usr/local/lib/python3.11/sqlite3/dbapi2.py
SQLITE_VERSION=2.6.0
CURL_VERSION=curl 8.4.0 (x86_64-pc-linux-gnu)
WGET_VERSION=GNU Wget 1.21.3
WGET_AUTO_COMPRESSION=True
RIPGREP_VERSION=ripgrep 13.0.0
SINGLEFILE_VERSION=None
READABILITY_VERSION=None
MERCURY_VERSION=1.0.0
GIT_VERSION=None
YOUTUBEDL_VERSION=None
CHROME_VERSION=None
NODE_VERSION=None |
Beta Was this translation helpful? Give feedback.
-
great, this is working, thank you. |
Beta Was this translation helpful? Give feedback.
Ah that's easy, set
URL_DENYLIST
to exclude URLs ending in.zip
like so:https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#url_denylist