Cookies for Capture: Using ArchiveBox’s –cookie Option for Authenticated Web Content in OSINT

Open source intelligence (OSINT) often requires digging into corners of the web that aren’t publicly visible. Enter ArchiveBox, a self-hosted web archiving tool that lets you save web pages locally in various formats (HTML, PDF, WARC, etc.) (Key Features — ArchiveBox 0.8.6rc3 documentation) (Key Features — ArchiveBox 0.8.6rc3 documentation). ArchiveBox is like your personal Wayback Machine, only smarter: it can archive both public and private web content while you retain full control of the data (Key Features — ArchiveBox 0.8.6rc3 documentation). In this post, we’ll explore how OSINT professionals can use ArchiveBox’s --cookie feature (a way to feed it your browser session cookies) to capture authenticated content behind logins. We’ll also cover how to export cookies from your browser, the steps to use them with ArchiveBox, and important ethical considerations. So grab a cup of coffee (and a cookie 🍪), and let’s dive in!

Why Archive Logged-In Content Matters in OSINT

In internet investigations, some of the most crucial information hides behind login screens or paywalls. For example, a private social media group, a gated forum, or a subscription-only news article might hold key evidence. Traditional archiving services like the Internet Archive often cannot capture these pagesthey either get blocked by login requirements or miss dynamic content. In fact, the Internet Archive won’t capture Facebook pages or other logged-in areas, and it can struggle with sites that use heavy JavaScript or have robots.txt restriction (Nixintel Open Source Intelligence & Investigations Make Your Own Internet Archive With ArchiveBox)】. As a result, investigators risk losing critical information if they rely solely on public archiving.

By archiving authenticated content, OSINT investigators can ensure they preserve a complete picture of a webpage as they see it when logged in. This means grabbing the comments, private posts, or internal data that an anonymous scraper would never fetch. In short, being able to archive “inside” a site is often necessary to gather full context and evidence during an investigation.

Meet ArchiveBox: Your Local Web Archiving Sidekick

ArchiveBox is an open-source tool that runs on your own machine to save snapshots of webpages in multiple format (Nixintel Open Source Intelligence & Investigations Make Your Own Internet Archive With ArchiveBox)】. It’s like having a personal archive that captures not just raw HTML, but also things like screenshots, PDFs, and even videos or audio files embedded on the pag (Nixintel Open Source Intelligence & Investigations Make Your Own Internet Archive With ArchiveBox)】. For OSINT work, this is a game-changer: ArchiveBox can preserve JavaScript-heavy sites and media content that other tools might mis (Nixintel Open Source Intelligence & Investigations Make Your Own Internet Archive With ArchiveBox)】.

How does it work? You feed ArchiveBox a list of URLs (one at a time or in bulk), and it uses a variety of “extractors” (headless Chrome, wget, curl, yt-dlp, etc.) to pull down everything from the page. The output is stored in a timestamped folder, with the page saved as HTML, a screenshot PNG, a PDF, a WARC (Web ARChive format), and more. These formats ensure your data is future-proof – *readable for decades to come (Key Features — ArchiveBox 0.8.6rc3 documentation)】 without requiring proprietary software.

By default, ArchiveBox archives public pages (just like an anonymous user would see). But what if the content is private or login-required? That’s where the cookies come in!

The --cookie Option: Archiving as an Authenticated User

ArchiveBox has a nifty ability to use your browser’s session cookies to access content as if it’s you. In simpler terms, you can give ArchiveBox a “golden ticket” (your login session) so it can crawl pages that normally require you to be signed in. This is often referred to as using a cookies file (hence the --cookie option). Under the hood, ArchiveBox supports archiving content behind logins by using either a Chrome user profile or a cookies file with your session inf (Security Overview · ArchiveBox/ArchiveBox Wiki · GitHub)】. We’ll focus on the cookies file method, as it’s straightforward and effective for most cases.

How it works: Web authentication often relies on a session cookie – a small piece of text stored in your browser after you log in, which tells the website “this user is authenticated.” ArchiveBox, when provided with your session cookie, will include it in its web requests. This tricks the target site into thinking ArchiveBox’s web scraper is your logged-in browser. As a result, ArchiveBox can see and save all the content that you are allowed to see.

ArchiveBox expects cookies in the Netscape cookie format (a plain-text format commonly named cookies.txt). This format is used by tools like wget and curl to supply cookies. When you configure ArchiveBox’s COOKIES_FILE setting to point to your cookies.txt, it will pass those cookies to its archiving tools. For example, the wget module will use them to fetch pages behind logi (Configuration — ArchiveBox 0.6.2 documentation)】. Even the headless Chrome part and Youtube-DL (for videos) can utilize the session info if needed – meaning you can grab paywalled articles, private social media posts, or unlisted videos that your account has access to, all in one ArchiveBox run.

Tip: ArchiveBox’s documentation notes that archiving content behind logins is an advanced use-case and recommends creating a dedicated account for i (Security Overview · ArchiveBox/ArchiveBox Wiki · GitHub)】. This is wise – you probably don’t want to use your personal Facebook account’s cookies, for example. We’ll talk more about ethical safeguards in a bit.

Exporting Session Cookies: Getting Your cookies.txt

Before we can feed ArchiveBox our cookie, we need to export the session cookie from our web browser. There are a couple of ways to do this, and we’ll outline two common methods:

  • Using a Browser Extension: The easiest method is to use a cookie export extension. For Chrome (and browsers like Edge or Brave), popular choices are EditThisCookie or Get cookies.txt. These extensions let you quickly save the cookies for the site you’re on. For example, with EditThisCookie you can click an “Export” button to copy cookies, and even choose the Netscape format (sometimes labeled “cookies.txt”) in the extension’s option (Import/Export cookies – EditThisCookie)】. Similarly, the “Get cookies.txt” extension will directly download a cookies.txt file for the current site.
    1. Install the extension and navigate to the website you need to archive (make sure you’re logged in and can see the content).
    2. Click the extension and look for an Export or Download cookies option. Choose the Netscape format if prompted.
    3. Save the exported text as a file named cookies.txt. This file will contain lines of cookies, including your all-important session cookie for that site.
  • Using Browser Developer Tools: If you prefer not to use an extension, you can manually grab cookies via your browser’s dev tools. In Chrome, for instance, open DevTools (F12) and go to the Application tab, then find Cookies under Storage. You can view the cookies for the site and copy their values. However, simply copying and pasting might not be in the correct format. You would have to manually format it or use a script to convert it to Netscape style. There are scripts available that take cookie details and output a Netscape cookies.txt (for example, a Node.js script on GitHub can do this conversio (dandv/convert-chrome-cookies-to-netscape-format – GitHub)】). This method is more involved, so using an extension is usually faster and less error-prone.

Whichever method you choose, the end result should be a cookies.txt file in Netscape format. It’s a plain text file where each line corresponds to a cookie with fields like domain, path, name, value, etc.

Important: Ensure that the cookies you export include the session/token for the target site. If you’re archiving multiple sites that each need login, you might combine cookies for all those domains into one file – or handle them one site at a time with separate cookie files. Generally, it’s simplest to do one site at a time: log in to the site, export cookies.txt for it, then run ArchiveBox on the URLs from that site using that cookies file.

Feeding the Cookie to ArchiveBox

Now for the fun part – giving ArchiveBox its “login snack.” There are a couple of ways to tell ArchiveBox to use your cookies file. We’ll illustrate a straightforward command-line approach:

  1. Place the cookies file in a convenient location (for example, ~/cookies.txt or in your ArchiveBox project directory).
  2. Run ArchiveBox with the --cookie option (cookies file). If you installed ArchiveBox on your system (via pip), you can use an environment variable or ArchiveBox’s config to point it to the file. For a one-time use on the command line, it’s easy to prefix the command with the COOKIES_FILE variable. For example:
# Tell ArchiveBox where the cookies file is, then add a URL to archive
COOKIES_FILE="~/cookies.txt" archivebox add 'https://example.com/private/page"

In this example, we export an environment variable COOKIES_FILE pointing to our cookies file, and then run the archivebox add command with the target URL. ArchiveBox will pick up the cookies and use them during archivin (Configuration — ArchiveBox 0.6.2 documentation)】. When the archiving process runs, you should see ArchiveBox’s usual output, but now it will retrieve content as an authenticated session. If the login was successful, the archived page will include the previously gated content (you can verify by opening the saved HTML or screenshot to ensure the private data is present).

Alternatively, you can permanently set the cookies file in ArchiveBox’s configuration if you plan to reuse it often. For instance, you can run:

archivebox config --set COOKIES_FILE=/path/to/cookies.txt

This saves the setting, so subsequent archivebox add runs will automatically use that cookies file (until you change it). This might be handy if you’re archiving a bunch of URLs from the same site that all require the same login.

Using Docker? Many OSINT folks run ArchiveBox via Docker. In that case, you’ll want to make sure the container can access your cookies file. One approach is to mount the file into the ArchiveBox container (for example, place it in your ArchiveBox data directory so it’s persisted). Then you can set the COOKIES_FILE environment variable in the Docker command or Compose file. For example, in docker-compose.yml under the ArchiveBox service, you might add:

environment:
  - COOKIES_FILE=/data/cookies.txt

and ensure you have ./cookies.txt on the host mapped to /data/cookies.txt in the container. Once configured, run the archivebox add through Docker as usual. The concept is the same: ArchiveBox inside the container reads the cookies file and uses it to authenticate.

Responsible (and Ethical) Use of Cookie-Based Archiving

Using your authenticated session to archive content gives you great power – and as the saying goes, with great power comes great responsibility. Here are some ethical and practical guidelines to keep in mind:

  • Use a Dedicated Account: It’s highly recommended to use a separate or “throwaway” account for archiving protected content. ArchiveBox’s developers themselves warn not to use your personal browser profile or cookies, because any session cookies you pass in could end up stored in the archive dat (Security Overview · ArchiveBox/ArchiveBox Wiki · GitHub)】. If someone else accessed your ArchiveBox outputs, they might find your cookie and potentially reuse it to impersonate you! To avoid this, use an account that you don’t mind isolating, and still treat the resulting archive as sensitive. (You can always log out or invalidate the session after archiving to be safe.)
  • Don’t Publicly Share Sensitive Archives: When you archive content from behind a login, assume that sensitive data or identifiers might be embedded in the saved files. For instance, the HTML could contain your username or the cookie in a header dum (Security Overview · ArchiveBox/ArchiveBox Wiki · GitHub)】. It’s best to keep such archives in a secure location and not publish them openly. If you need to share the archived content, consider sanitizing it (removing or redacting any private tokens or personal info).
  • Respect Terms of Service and Privacy: Just because you can archive something you’re logged into doesn’t always mean you should. Make sure you are authorized to view and save that content. For example, using your own login to archive your own feed or a group you’ve lawfully joined is one thing; using someone else’s credentials without permission or scraping a site in violation of its terms could be unethical or illegal. As an OSINT professional, you likely have guidelines on what’s fair game – stick to those. The --cookie technique should be used to preserve evidence and information for the public good or your client’s legitimate needs, not for snooping where you shouldn’t.
  • Moderation is Key: Avoid archiving huge amounts of data behind a login in a way that might look like a bulk-scraping attack to the site. You don’t want your account banned. ArchiveBox does allow rate limiting and other configurations if needed. Be a polite archivist even when logged in.

By following these practices, you ensure this powerful technique remains available and doesn’t backfire. Responsible use also helps protect the reputation of OSINT investigators as ethical and law-abiding.

Benefits of ArchiveBox with Authenticated Content

Using ArchiveBox’s cookie-powered archiving can level-up your investigation. Here are some key benefits of this technique:

  • Complete Content Capture: You get to save everything an authenticated user sees – comments, private posts, images, and other dynamic elements that would be missing in an anonymous archive. This provides a fuller context for analysis, as you’re not just seeing a public teaser or placeholder text. For example, an archived members-only forum thread will include all the replies and media that members can see, not just a login prompt.
  • Bypass of Paywalls & Gated Content: If you have legitimate access (subscription or account), ArchiveBox will capture the page as you would see it after logging in. This is great for preserving paywalled news articles or research papers. Instead of saving just a summary or an error page, you’ll have the full text and associated media in your offline archive.
  • Dynamic and Media Content Preservation: Many modern sites load content via JavaScript after initial page load (think of infinite scroll social feeds or interactive maps). ArchiveBox, especially when running with your session, can execute that JS and load those bits (thanks to the headless browser component). The result is a more faithful snapshot. Additionally, if there are videos or audio that require login (say an unlisted YouTube video shared in a private group), ArchiveBox’s integration with yt-dlp can download those media files when provided with the right cookies/toke ([EPUB] ArchiveBox 0.4.18 documentation)】. You won’t have to worry about missing out on embedded content.
  • Long-Term Access & Evidence Preservation: Once you’ve archived the content, you have long-term access to it, even if it disappears from the live site or your account loses access. ArchiveBox stores multiple formats (HTML for readability, WARC for an exact record, PDF for easy sharing, etc.) which are standard and durabl (Key Features — ArchiveBox 0.8.6rc3 documentation)】. This means years from now, when the site might be gone or changed, you can still open your archive and see the content as it was. For OSINT investigators, this is crucial for maintaining evidence. You can demonstrate what you found, when you found it, in a form that’s not dependent on the original website being up.
  • Operational Security (OpSec): Using a local archiving tool like ArchiveBox with your login can be more private than using an online service. Your data stays with you. Plus, you can run ArchiveBox in an isolated environment. No external entity sees what URLs you are archiving (unlike using a third-party archiver that might log your activity). This can protect your investigation from tipping off adversaries. Just be sure to also protect your ArchiveBox itself (secure the machine or server it runs on, as it will contain sensitive info).

Wrapping Up

ArchiveBox’s --cookie capability adds a powerful arrow to the OSINT quiver. By leveraging your authenticated session, you can capture web pages that are off-limits to normal scrapers, ensuring nothing slips through the cracks of your investigation. We covered how to get your cookies via browser extensions or dev tools, how to configure ArchiveBox to use them, and the do’s and don’ts to use this technique safely.

In a lighthearted sense, we’ve learned that sometimes to get the goodies, you’ve got to give a cookie! 🍪 ArchiveBox with cookies enables you to preserve the web exactly as you see it, which is incredibly valuable in OSINT work. So next time you’re facing a login wall or a pesky “Please subscribe to continue reading” message, remember that ArchiveBox is ready to help – just bring along your session cookie.

Happy archiving, and stay ethical out there! With great power (and cookies) comes great responsibility, but used wisely, this approach will ensure your investigative breadcrumbs don’t get lost on the web. Now go forth and archive those hard-to-get pages – your future self (and maybe the entire investigation community) will thank you.

Sources: