Digital Footprint in the AI Era

Posted on January 19, 2026 • 7 minutes • 1327 words

The AI era has fundamentally changed the privacy landscape. Every piece of content you create online - blog posts, social media updates, code repositories, photos - is being scraped, indexed, and fed into training datasets. Companies are building AI models on your data without consent, and the traditional privacy playbook no longer applies.

As someone who’s been pro-Linux and open source for years, I’ve always valued transparency and control over my digital life. But the AI revolution has forced me to rethink my approach to digital privacy. Here’s what I’ve learned and what you should consider.

The New Reality

AI companies are training models on everything they can get their hands on. Your public GitHub repos? Scraped. Your blog posts? Indexed. Your social media posts? Training data. Even private data isn’t safe - we’ve seen multiple instances of AI assistants leaking private conversations or training on user data that was supposed to be private.

The problem isn’t just about what you post publicly. It’s about:

Data aggregation: Multiple data points combined to build comprehensive profiles
Inference attacks: AI models inferring sensitive information from seemingly harmless data
Permanent records: Once your data is in a training set, it’s there forever
Lack of control: No way to opt-out or delete your data from already-trained models

What You Can Do

1. Own Your Infrastructure

The best way to control your data is to own where it lives. I’ve moved most of my services to self-hosted solutions:

NAS: My own network-attached storage for backups, file storage, and media
Self-hosted cloud storage: Nextcloud or similar solutions instead of Google Drive/Dropbox
Self-hosted Git: Gitea or Forgejo instead of GitHub (or at least mirror your repos)

Yes, it’s more work. But you have full control, and your data isn’t being mined for AI training.

I prefer to own my infrastructure where it makes sense. However, I’m practical about it - I don’t host my own email because it’s just too much struggle. Email is one of those services where the operational overhead (deliverability, spam filtering, security) outweighs the benefits of self-hosting for most people. But having my own NAS gives me a solid foundation for storing and backing up my data without relying on cloud services.

2. Use Open Source Tools

Open source isn’t just about code freedom - it’s about transparency. You can audit what the software is doing with your data. Here’s my stack:

Linux: The foundation. Full control over your operating system.
Firefox: Open source browser with strong privacy features
Bitwarden: Open source password manager (self-hosted option available)
Joplin: Open source note-taking with end-to-end encryption
Fastmail: Privacy-focused email provider with good deliverability and alias support

3. Browser Privacy

Your browser is your primary attack surface:

Firefox with uBlock Origin: I’ll be honest: I think Chrome is better feature-wise. But I choose Firefox anyway because Chrome is becoming the new IE, and it’s heavily influenced by Google. We need browser diversity to prevent a single company from controlling web standards. Firefox is open source, privacy-focused, and the only real alternative keeping the web open.
Privacy Badger: Learns and blocks trackers
Disable third-party cookies: Already happening, but ensure it’s enabled
Use containers: Firefox’s container tabs isolate different browsing contexts

4. Email Privacy

Email is still the backbone of digital communication. While I prefer to own my infrastructure, email is one service I don’t self-host - it’s too much operational overhead. Instead, I use Fastmail, which offers:

Privacy-focused: No ads, no data mining, transparent about their practices
Alias support: Built-in email aliases for signing up to services
PGP/GPG: For sensitive communications, use proper encryption on top

The key is choosing providers that respect your privacy and give you control over your data, even if you don’t own the infrastructure.

5. The Linux Advantage

Linux gives you control that proprietary operating systems can’t match. You have full system control - you decide what runs and what data is collected. Most Linux distributions don’t collect telemetry, and you can audit the source code of installed software. With granular firewall control via iptables or nftables, you have complete visibility and control over your network traffic. Whether you choose privacy-focused distros like Qubes OS or Tails, or just a well-configured Debian or Arch, Linux puts you in the driver’s seat of your privacy.

Compare this to Windows, which is pushing Copilot into every corner of the OS . You can’t escape it - it’s in the taskbar, the Start menu, the file explorer, and it’s constantly trying to collect your data to feed Microsoft’s AI services. With Linux, if you don’t want AI features, you simply don’t install them. If you do want them, you choose which ones and how they work. That’s the difference between a system that respects your choices and one that forces its agenda on you.

6. LLMs and Open Weight Models

When it comes to AI, there’s a crucial distinction between closed models and open weight models. Closed models like ChatGPT, Claude, and Gemini run on company servers. Every query you send goes to their servers, gets logged, and becomes part of their training data. You have no idea what they’re doing with your data, and you can’t audit the model itself.

Open weight models like Llama, Mistral, and Qwen are different. These models release their weights (the trained parameters) publicly, allowing you to run them locally on your own hardware. This means:

Your data stays local: Queries never leave your machine
No logging: No company is tracking what you ask
More transparency: You can inspect the model architecture and weights, though training data and methodology are often not disclosed
Control: You can fine-tune, modify, or even fork the model

Running models locally requires hardware (a decent GPU helps), but the privacy benefits are worth it. Tools like llama.cpp or vllm make it relatively easy to run these models on your own machine. For most tasks, open weight models are getting close to closed models in quality, and the gap is narrowing fast.

However, there’s an important caveat: model bias. You can’t escape bias for now, unless models are truly open. That means everything - training data, parameters, fine-tuning datasets, and the entire training process - needs to be transparent and auditable. Most “open weight” models today only release the weights, not the training data or methodology. This means you still inherit whatever biases were baked into the training process, and you can’t fully audit or understand where those biases come from.

The open source philosophy applies to AI too. If you care about privacy and transparency, choose open weight models over closed ones. But remember: true transparency requires more than just open weights - it requires open training data, open processes, and open methodology. Your data shouldn’t be the price you pay for using AI, and you should be able to understand and challenge the biases in the models you use.

Here’s the reality: complete privacy in the AI era is nearly impossible if you want to participate in modern digital life. But you can significantly reduce your exposure and maintain control over your most sensitive data.

The key is to:

Own your infrastructure where possible
Use open source tools for transparency
Minimize your footprint by being intentional
Accept trade-offs between convenience and privacy
Stay informed about new threats and solutions

Final Thoughts

I’m not advocating for complete digital isolation. The internet is an incredible tool, and sharing knowledge and connecting with others has value. But we need to be more intentional about what we share and how we share it.

The AI era has made privacy a moving target. What works today might not work tomorrow. But by using open source tools, owning your infrastructure, and being mindful of your digital footprint, you can maintain a reasonable level of privacy and control.

Remember: you can’t control what AI companies have already scraped, but you can control what they scrape going forward. Start today, and make privacy a habit, not an afterthought.