ChatGPT slurped up my Mastodon, now what?
I recently came across a post on Mastodon highlighting that LLMs were able to report a lot of info on a user persona for a given Mastodon username. Just the username mind you, not the full handle which includes the instance.
This troubled me so I popped on over to ChatGPT and started to see what it knew. I started by trying to reproduce the results I saw from the original users that alerted me. Using their usernames, I was unable to reproduce initially. Without being logged in, it seems that ChatGPT doesn’t index anything on the web. I logged in and got identical results from the posts I saw on Mastodon.
“Okay so it knows atomicpoet and dansup. Those guys have huge followings. I have 50 followers on Mastodon. Surely I’m safe?” I thought. So naive… Not only did it find my Mastodon profile. It also found my personal site, (the one you’re probably reading this on) my Github, and my dev.to account. — I haven’t used dev.to in years… Why do I keep leaving breadcrumbs for these God forsaken blood sucking cockroac… — Breathe… Breath…
Okay so they got me. My info is in ChatGPT. What now, what can I do?
If you have an OpenAI account you will want to login and ask it about any usernames you have used. You will see what sites are being cited in the response. This is the beginning of your audit. For each site you need to determine what level of control you have, as well as what you are willing to do. This step is called threat modelling. When creating a threat model, you need to weigh the personal value of your data, against the likelihood that it will be comprimised, as well as what level of inconvienence you are willing to take on to secure the data. If it is unlikely to be comprimised, you probably don’t want to spend all your time securing it. However in this case we know that OpenAI is scraping the data from the cited sites, and I have decided I want to do something about it. When creating your threat model, ask yourself questions like the following; Can you set privacy settings? How much are you willing to share? Is there anything that you posted on those accounts that you defintely don’t want being brought up by an LLM? Are you using the same username for different user personas? (Work, personal, etc…) These are the questions you need to ask yourself.
Once you have these questions answered — and ideally a few of your own — you have your action items. For me, they were my personal site, Github, Mastodon, and dev.to. Let’s get started
Personal websites, Mastodon
I believe it’s possible that they are searching the web for the username you type in, and finding any sites that use it. However I did link my personal site and Github in my Mastodon profile. So they could of just hopped from Mastodon to those places. Try removing links from your Mastodon if you think they are more trouble than they are worth. I will be keeping mine.
Funily enough, back in 2023, OpenAI announced their webcrawlers and included a guide to opt out for webmasters.
I used Plagiarism Today’s example robots.txt to block many popular ai services. Simply add their robots.txt to the root of your site. This should take care of any indexed sites that you own. Add the robots.txt to your self-hosted fedi instance, or whatever and those should be taken care of as well. Nice!
One thing of note, I haven’t been able to confirm that robots.txt works in this case for Mastodon, as the self hosted instance I have wasn’t being scraped by OpenAI’s indexers. I believe this is because my instance hasn’t really federated to other servers yet. If someone, preferably @[email protected] and @[email protected], could do me a favor and add it. I’ll report back on this post.
Mastodon’s robot.txt lives in the /public directory of the source code. As a Mastodon administrator you can connect to your server and edit it from the terminal using vim or nano.
If you’re not a Mastodon administrator, reach out to the admin of your instance and ask them to do this work for you. (This is what I’m doing!) You can send them this post as a guide!
Github
Github, which like OpenAI is a Microsoft product, I doubt there is much you can do to prevent indexing of public repositories. At the very least, you’ll probably want to remove any bio, links, locations specified, ect… If you still want people on Github to find you, add a link to your freshly gpt-blocked webpage for any info they might need! You can also private any repositories that you wouldn’t want visible to AI. Github allows both free and paid users to have unlimited public and private repos, however some advanced features may require a Github plan.
dev.to
If you’re on dev.to, you’re on your own. I’m just deleting my account. Sowwy 👉👈
The best way to maintain privacy
The obvious method that secures all, we should all use, but seldom do. Delete all your accounts, rid your life of digital technology, Cut every cable in your house, and start growing vegetables and livestock… The internet… was a MISTAKE…
But then you wouldn’t be able to follow me on RSS, Mastodon, Github, and dev.to… Nah I’ll keep the beep boops and blinky lights. Maybe one day the solar flare will come and whipe the grid. Until then, same bat time, same bat channel. Thanks for reading.
If you are mentioned in this article, and would like to be removed, please reach out to me on Mastodon and I will remove you from the post.
EDIT: It’s worth noting I’ve implemented these changes on my end, but still get the same results from my username. I believe there might be some caching that needs to expire before the changes take effect. Fingers crossed.
About the Author
Orignally posted at wandy.dev