llms.txt directory

(directory.llmstxt.cloud)

95 points by pizza 2 days ago | 51 comments

Yes! Please standardize the web into simple hypertext so “LLMs can use it”. I promise I won’t build any tools to read it without the ads, tracking, and JavaScript client side garbage myself. I will not partake in any such efforts to surf the web as it was intended be, before its commercialization and commodification. No, sir, I could never!

nerdix 9 hours ago | root | parent | next |

I wish this were true too but it doesn't appear that llms.txt is meant to be a markdown version of the site's content based on the example on the site and after peeking at a few llms.txt files on some of the sites in the directory.

It looks like its text meant to be fed into the llm as a system prompt specific to the site.

The most simple ones just look like a sitemap restricted to documentation: https://www.activepieces.com/docs/llms.txt

Some interesting stuff is in some of them. Like this one that prompts the LLM to explain to the user the ethical issues of using AI agents along with a disclaimer:

https://boehs.org/llms.txt

That site appears to be someone's blog and they don't seem like big fans of LLMs.

https://boehs.org/node/llms-destroying-internet

A pretty clever use of llms.txt.

Eisenstein a day ago | root | parent | prev |

The problem I see is that people will develop two versions of websites -- the LLM version, optimized to give the model a good impression of their products or services and get them into the training data, and the human version (with sub-versions for mobile and such) which will be SEOd to hell.

No one wins in the long run by creating technical solutions to human incentive problems. It is just a prolonged arms race until

* the incentives are removed

* the process is made so technically complex or expensive that only a few players can profit from them

* it is regulated such that people can make money doing other things which have better risk/reward

* most people just avoid the whole ecosystem because it becomes a cesspool

ssl-3 a day ago | root | parent |

A brief guide on avoiding this pitfall, for search engine and LLM operators:

Step 1: Punish ranking and visibility of sites whose llms.txt differs from a random sampling of actual web (HTML) content.

Step 2: There is no step 2.

Retr0id 15 hours ago | root | parent | next |

If you can reliably machine-diff format A and format B, there's no need for two different formats in the first place.

> No one wins in the long run by creating technical solutions to human incentive problems. It is just a prolonged arms race

But maybe this time the naive technical solution will work

ssl-3 a day ago | root | parent |

Everything is temporary.

You, me, LLMs, Google, humanity, Earth, Sol...

We can choose to carry on and perform [what we think are] improvements and make the best of it, or we can choose to cash it in early and simply give up.

Eisenstein a day ago | root | parent | prev |

Brief guide for site optimizers:

1. Figure out how to embed content that only LLMs see which affect their output

2. Wait for that to stop working

3. Innovate another way to get past new technical problem

riffraff a day ago | prev | next |

llms.txt has a section on "Existing standards" which completely forgets about well-known[0], there's an issue opened three months[1] ago but it seems it was ignored.

[0] https://en.wikipedia.org/wiki/Well-known_URI

[1] https://github.com/AnswerDotAI/llms-txt/issues/2

varenc a day ago | root | parent | next |

Using `/.well-known/llms.txt` was the first thought that came to my mind as well. There's a long list of other special well-known URIs using it: https://en.wikipedia.org/wiki/Well-known_URI#List_of_well-kn...

8n4vidtmkvmk a day ago | root | parent |

Why not llms.md since it's markdown?

Exactly, this fits perfectly in the `.well-known` use cases. What a shame.

jph00 a day ago | root | parent | prev |

Unfortunately this requires registration, which is not a simple process.

sneak a day ago | root | parent |

No more than throwing stuff in the root.

jph00 a day ago | prev | next |

Folks, please note that this proposal is designed to help end users who wish to use AI tools. For instance, so that when you use Cursor or vscode you can get good documentation about the libs you use when coding, for the LLM to help you better.

It’s not related to model training. Nearly all the responses so far are about model training, just like last time this came up on HN.

For instance, I provide llms.txt for my FastHTML lib so that more people can get help from AI to use it, even although it’s too new to be in the training data.

Without this, I’ve seen a lot of folks avoid newer tools and libs that AI can’t help them with. So llms.txt helps avoid lock-in of older tech.

(I wrote the llms.txt proposal and web site.)

8n4vidtmkvmk a day ago | root | parent | next |

This sounds quite good in that case. I've been attempting to convert documentation into markdown to feed to the LLMs to get fresher/more accurate responses.

In this case, we might need a versioning scheme. Libraries have multiple versions and not everyone is on the latest. I still need a way to point my LLM to the ver I'm actually using.

13 hours ago | root | parent | prev |

[deleted]

bradarner a day ago | prev | next |

Have there been any declarations by various AI companies (e.g. OpenAI, Anthropic, Perplexity) that they are actually relying upon these llms.txt files?

Is there any evidence that the presence of the llms.txt files will lead to increased inclusion in LLM responses?

ashenke a day ago | root | parent | next |

And if they are, can I put subtly incorrect data in this file to poison llm responses while keeping my content designed for humans of the best quality?

bradarner a day ago | root | parent |

I'm curious, what would be the reason for doing this?

hiccuphippo a day ago | root | parent | next |

Undermine the usefulness of llms in an attempt to force people to visit your site directly.

If one doesn’t want LLMs to scrape data and knows the LLMs will be ignoring the robots.txt file.

CamperBob2 10 hours ago | root | parent | prev |

Keep in mind you're asking this question on a site where users regularly defend the Luddites, Ted Kaczynski, and other people who thought they were doing great things for humanity but who actually weren't even doing themselves any favors.

nunodonato a day ago | root | parent | prev |

Anthropic itself publishes a bunch of its own llm.txt files. So I guess that means something

jsheard a day ago | prev | next |

It's telling that nearly every site listed is directly involved with AI in some way, unsurprisingly the broader internet isn't particularly interested in going out of its way to make its content easier for AI companies to scrape.

Deliberately putting garbage data in your llms.txt could be funny though.

spencerchubb a day ago | root | parent | next |

You seem to be misunderstanding why a website would make llms.txt

Obviously, they would not make it just for an AI company to scrape

Here's an example. Let's say I run a dev tools company, and I want users to be able to find info about me as easily as possible. Maybe a user's preferred way of searching the web is through a chatbot. If that chatbot also uses llms.txt, it's easy for me to deliver the info, and easy for them to consume. Win-win

Of course adoption is not very widespread, but such is the case for every new standard.

ppqqrr a day ago | root | parent |

The point of LLMs is they are able to make sense of the web the same way humans can (roughly speaking); so why do they get the special treatment of having direct, ad-free, plain text version of the actual info they’re looking for, while humans aren’t allowed to scroll through a salad recipe without being bombarded with 20 ads?

spencerchubb a day ago | root | parent |

A human could read the llms.txt if they want to. And a developer could put ads in llms.txt if they wanted to!

I've seen many people joke about intentionally poisoning training data but has that ever worked?

jsheard a day ago | root | parent |

It's hard to gauge the effectiveness of poisoning huge training sets since anything you do is a figurative drop in the ocean, but if you can poison the small amount of data that an AI agent requests on-the-fly to use with RAG then I would guess it's much easier to derail it.

nyrikki a day ago | root | parent |

This study shows that controlling 0.1% may be enough.

https://arxiv.org/abs/2410.13722v1

I have noticed some popular copied but incorrect leetcode examples leaking into the dataset.

I suspect it depends on domain specificity, but that seems within the ability of an SEO spammer or decentralized group of individuals.

ALittleLight a day ago | root | parent | prev |

Seems silly to put garbage data there. Like intentionally doing bad SEO so Google doesn't link you.

I think you should think about it as: I want the LLM to recognize my site as a high quality resource and direct traffic to me.

Imagine user asks ChatGPT a question. LLM has scrapped your website and answers the question. User wants some kind of follow up - read more, what's the source, how can I buy this, whatever - so the LLM links the page it got the data from.

LLMs seem like they're supplanting search. Being early to work with them is an advantage. Working to make your pages look low quality seems like an odd choice.

shreddit a day ago | root | parent | next |

That sounds like those “react youtubers” taking your content without permission and telling you that you should be grateful for the exposure.

ALittleLight a day ago | root | parent |

Is it a good response to the react YouTubers to make your content terrible? Or to provide something in your content not available on their?

Whether you like it or not LLMs are going to be how people explore the web. They simply work better than search engines - not least because they can quickly scan numerous sites simultaneously, consume and synthesize the content.

You can choose to sabotage your own content in a likely futile effort to make things worse for LLM users if you want - my point is just that it serves no purpose and misses out on the opportunities in front of you.

Are you kidding? The follow-through on attribution links, where present, is nearly zero. There’s no gains to be had here, only losses.

I'd prefer not to play that game. I'd rather lose a bit of money and traffic and not help LLMs as far as humanly possible.

a day ago | root | parent | prev |

[deleted]

Lariscus a day ago | prev | next |

Making it easier for tech companies to steal my art. Sure, I will get right to it. In what world do these thieves live? I hope they catch something nasty!

bongodongobob a day ago | root | parent |

Has your art been stolen in the past? If so, how did you get it back?

sneak a day ago | prev | next |

Poorly specified, wedging structured data into markdown, not widely supported, ignores /.well-known/.

I also don’t understand the problem it purports to solve.

whoistraitor a day ago | prev | next |

Perplexity is listed, but do they actually abide by llms.txt? And how can we prove they do? Is it all good faith? I wish there were a better way.

jsheard a day ago | root | parent |

llms.txt isn't an opt-out signal like robots.txt, it's a way to provide a simplified version of pages that are easier for LLMs to ingest. It's more of an opt-in for being scraped more accurately.

weare138 a day ago | root | parent |

Or scraped inaccurately. It seems like you could have some fun with this if you were so inclined...

jrh3 a day ago | prev | next |

For trusted sites, this is a logical next step.

Juliate a day ago | prev | next |

Why should websites implement yet another custom output format for people^Wsoftware that won’t bother to use existing, loosely yet somewhat structured, open formats?

fchief a day ago | root | parent |

In a world where people gravitate to LLMs for quick answers instead of wading through ads and whatnot, it seems like you would want an LLM to site your content for further context. If the user just wanted an answer, they probably wouldn't have spent much time on your site.

freeone3000 a day ago | root | parent |

The ads and whatnot are why the site exists! That’s the point, with the content being the hook. If people aren’t looking at the ads, it’s a loss.

vouaobrasil a day ago | prev |

This is a great resource to at least figure out all the LLMs out there and block them. I already updated my robots.txt file. Of course, that is not sufficient, but at least it's a start and hopefully the blocking can get more sophisticated as time goes on.

hiccuphippo a day ago | root | parent | next |

It looks like the opposite. It is a way to make your site easier to parse for LLMs.

vouaobrasil a day ago | root | parent |

It is, but you can use it as a list of targets for blocking.

keeganpoppen a day ago | root | parent | prev |

it's not "productive", of course, but i don't see any issue with expressing this opinion whatsoever. and i say this being about as starry-eyed a techno-llm-utopian-esque dreamer as they come... sure, the "google" version of LLMs paving over industry has already crossed the rubicon, but everyone should have to reckon the value that they are truly providing not just for consumers but for producers as well... and no one should be offended by showing up in someone's robots.txt... just as i'm sure this commenter is realistic enough to know and understand that putting entries in one's robots.txt is nothing more than a principled, aspirational statement about how the world should be, rather than any sort of real technological impediment.

(but we'll just ignore the obvious irony in that end bit about detection of bots getting smarter... wonder where all this "intelligence" will come from? probably not some natural source, but possibly some sort of... Antinatural Intelligence?)