Tuan-Anh Tran
AI is scraping everything. Here's how to protect your privacy and security using open source tools and Linux-first approaches in an age where your data is constantly being harvested.
January 19, 2026

Digital Footprint in the AI Era

Posted on January 19, 2026  •  7 minutes  • 1327 words

The AI era has fundamentally changed the privacy landscape. Every piece of content you create online - blog posts, social media updates, code repositories, photos - is being scraped, indexed, and fed into training datasets. Companies are building AI models on your data without consent, and the traditional privacy playbook no longer applies.

As someone who’s been pro-Linux and open source for years, I’ve always valued transparency and control over my digital life. But the AI revolution has forced me to rethink my approach to digital privacy. Here’s what I’ve learned and what you should consider.

The New Reality

AI companies are training models on everything they can get their hands on. Your public GitHub repos? Scraped. Your blog posts? Indexed. Your social media posts? Training data. Even private data isn’t safe - we’ve seen multiple instances of AI assistants leaking private conversations or training on user data that was supposed to be private.

The problem isn’t just about what you post publicly. It’s about:

What You Can Do

1. Own Your Infrastructure

The best way to control your data is to own where it lives. I’ve moved most of my services to self-hosted solutions:

Yes, it’s more work. But you have full control, and your data isn’t being mined for AI training.

I prefer to own my infrastructure where it makes sense. However, I’m practical about it - I don’t host my own email because it’s just too much struggle. Email is one of those services where the operational overhead (deliverability, spam filtering, security) outweighs the benefits of self-hosting for most people. But having my own NAS gives me a solid foundation for storing and backing up my data without relying on cloud services.

2. Use Open Source Tools

Open source isn’t just about code freedom - it’s about transparency. You can audit what the software is doing with your data. Here’s my stack:

3. Browser Privacy

Your browser is your primary attack surface:

4. Email Privacy

Email is still the backbone of digital communication. While I prefer to own my infrastructure, email is one service I don’t self-host - it’s too much operational overhead. Instead, I use Fastmail, which offers:

The key is choosing providers that respect your privacy and give you control over your data, even if you don’t own the infrastructure.

5. The Linux Advantage

Linux gives you control that proprietary operating systems can’t match. You have full system control - you decide what runs and what data is collected. Most Linux distributions don’t collect telemetry, and you can audit the source code of installed software. With granular firewall control via iptables or nftables, you have complete visibility and control over your network traffic. Whether you choose privacy-focused distros like Qubes OS or Tails, or just a well-configured Debian or Arch, Linux puts you in the driver’s seat of your privacy.

Compare this to Windows, which is pushing Copilot into every corner of the OS . You can’t escape it - it’s in the taskbar, the Start menu, the file explorer, and it’s constantly trying to collect your data to feed Microsoft’s AI services. With Linux, if you don’t want AI features, you simply don’t install them. If you do want them, you choose which ones and how they work. That’s the difference between a system that respects your choices and one that forces its agenda on you.

6. LLMs and Open Weight Models

When it comes to AI, there’s a crucial distinction between closed models and open weight models. Closed models like ChatGPT, Claude, and Gemini run on company servers. Every query you send goes to their servers, gets logged, and becomes part of their training data. You have no idea what they’re doing with your data, and you can’t audit the model itself.

Open weight models like Llama, Mistral, and Qwen are different. These models release their weights (the trained parameters) publicly, allowing you to run them locally on your own hardware. This means:

Running models locally requires hardware (a decent GPU helps), but the privacy benefits are worth it. Tools like llama.cpp or vllm make it relatively easy to run these models on your own machine. For most tasks, open weight models are getting close to closed models in quality, and the gap is narrowing fast.

However, there’s an important caveat: model bias. You can’t escape bias for now, unless models are truly open. That means everything - training data, parameters, fine-tuning datasets, and the entire training process - needs to be transparent and auditable. Most “open weight” models today only release the weights, not the training data or methodology. This means you still inherit whatever biases were baked into the training process, and you can’t fully audit or understand where those biases come from.

The open source philosophy applies to AI too. If you care about privacy and transparency, choose open weight models over closed ones. But remember: true transparency requires more than just open weights - it requires open training data, open processes, and open methodology. Your data shouldn’t be the price you pay for using AI, and you should be able to understand and challenge the biases in the models you use.

Here’s the reality: complete privacy in the AI era is nearly impossible if you want to participate in modern digital life. But you can significantly reduce your exposure and maintain control over your most sensitive data.

The key is to:

  1. Own your infrastructure where possible
  2. Use open source tools for transparency
  3. Minimize your footprint by being intentional
  4. Accept trade-offs between convenience and privacy
  5. Stay informed about new threats and solutions

Final Thoughts

I’m not advocating for complete digital isolation. The internet is an incredible tool, and sharing knowledge and connecting with others has value. But we need to be more intentional about what we share and how we share it.

The AI era has made privacy a moving target. What works today might not work tomorrow. But by using open source tools, owning your infrastructure, and being mindful of your digital footprint, you can maintain a reasonable level of privacy and control.

Remember: you can’t control what AI companies have already scraped, but you can control what they scrape going forward. Start today, and make privacy a habit, not an afterthought.

Follow me

Here's where I hang out in social media