Robots.txt Best Practices for SEO in 2025

Your website might be full of valuable content, but without the right crawl instructions, search engines can easily get lost in the noise. That’s where the robots.txt file comes in. As one of the simplest yet most powerful SEO tools, it tells search engines which parts of your site should be explored and which can be skipped.
When structured correctly, robots.txt helps prevent duplicate content, keeps error pages out of search results, and ensures that crawlers focus their attention on your most important pages. Misuse it, though, and you risk blocking critical resources or even your entire site.
In this guide, we’ll walk through how to design a safe, effective robots.txt strategy for 2025—covering file types, query strings, low-priority folders, error pages, and sitemaps—so your site stays clean, efficient, and user-focused in search.
Table of Contents
1. Blocking Irrelevant File Types
Disallow: /*.doc$
Disallow: /*.ppt$
Not every file format on your server provides value for search visibility. Legacy documents, presentations, and backend scripts typically don’t need to be crawled, as they offer little to no benefit in search results. By excluding them, you keep crawlers focused on the pages that matter most to users.
⚠️ Important note: PDFs can sometimes be valuable, especially if they contain guides, case studies, or whitepapers. If you want to prevent specific PDFs from appearing in search, use the X-Robots-Tag: noindex HTTP header or apply a meta robots directive — these methods provide more reliable control than robots.txt alone.
2. Handling Query Strings and Tracking Parameters
Disallow: /*?utm_*
Disallow: /*?ref=* Campaign tracking parameters such as UTM codes and referral IDs often generate multiple versions of the same page. If crawled, these duplicates can dilute ranking signals and lead to unnecessary index bloat. By blocking common tracking parameters, you help search engines concentrate on your canonical URLs — the versions you actually want to rank.
⚠️ Best practice: Avoid disallowing all query strings (?*) in robots.txt. Some parameters (like filters, internal search, or pagination) may provide value to both users and search engines. Instead, manage them selectively with:
- Canonical tags → to point duplicate variations back to the preferred URL.
- Google Search Console URL Parameters tool → for fine-tuned control.
- Consistent internal linking → so crawlers prioritise the clean version of each page.
3. Streamlining Low-Priority Folders
Disallow: /assets/tmp/
Disallow: /drafts/
Disallow: /private/ Some folders are meant for internal use only. Examples include temporary files, draft content, or private directories. These don’t provide value to users in search results, so excluding them helps keep your index clean, focused, and relevant.
⚠️ Important: Do not block essential resources such as CSS, JavaScript, or fonts. Google needs to crawl these files to accurately render your pages and evaluate factors like mobile usability and Core Web Vitals. Blocking them can negatively affect how your site is indexed and ranked.
4. Keeping Error and Utility Pages Out of the Index
Disallow: /404.html
Disallow: /forbidden.html
Disallow: /internal-server-error.html
Disallow: /unauthorised.html
Disallow: /company/thanks.html Pages like error screens and thank-you confirmations don’t provide value in search results. Excluding them helps reduce clutter in the index and ensures users only see meaningful, relevant content when they search for your brand.
⚠️ Best practice: Robots.txt prevents crawling, but it does not guarantee removal from search. To fully keep these pages out of the index:
- Serve the correct HTTP status codes (
404or410for missing/removed pages,403for restricted areas). - Add a
noindexdirective on confirmation or utility pages if they must remain accessible to users. - Use redirects where appropriate (e.g., send users from outdated URLs to a relevant live page).
This combination gives search engines a clear signal, ensuring your index remains clean and user-focused.
5. Sitemap Directives
Sitemap: https://www.example.co.uk/sitemap.xml
Sitemap: https://www.example.co.uk/images.xml
Sitemap: https://www.example.co.uk/video-sitemap.xml
Sitemap: https://www.example.co.uk/news-sitemap.xml
Sitemap: https://www.example.co.uk/sitemaps/sitemap-products.xml
Sitemap: https://www.example.co.uk/sitemaps/sitemap-blog.xml
Sitemap: https://www.example.co.uk/sitemaps/sitemap-en.xml
Sitemap: https://www.example.co.uk/sitemaps/sitemap-fr.xml Adding sitemap directives to robots.txt gives search engines clear paths to your most important content. This improves discoverability, ensures new pages are found faster, and helps crawlers understand your site structure.
⚠️ Note: Including sitemaps in robots.txt doesn’t guarantee indexing, but it does guide crawlers to the URLs you consider high-value. For best results, also submit your sitemaps directly in Google Search Console and keep them updated whenever content changes.
📌 Best practices:
- Always use absolute URLs (with
https://). - Keep sitemaps under 50MB or 50,000 URLs each; split into multiple files if necessary.
- Submit the same sitemaps in Google Search Console for maximum reliability.
6. Lessons Learned from Structuring Robots.txt
- Guide crawlers to what matters → Exclude non-essential file types and internal folders so search engines prioritise your core content.
- Maintain a clean index → Prevent duplicate URLs, error pages, and low-value screens from cluttering search results.
- Enable accurate rendering → Always allow access to CSS, JavaScript, and fonts so Google can properly evaluate mobile-friendliness, Core Web Vitals, and overall user experience.
- Use the right method for the job →
robots.txtcontrols crawling, not indexing. Pair it with noindex directives, canonical tags, or proper HTTP status codes to fully manage how pages appear in search.
7. Final Thoughts
At DBETA, we believe even the smallest technical details shape the bigger picture of search performance. Robots.txt is more than a list of restrictions—it’s a strategic tool for guiding crawlers toward what matters most. When used thoughtfully, it helps reduce index clutter, highlight your best content, and create a smoother crawling experience.
To manage robots.txt more effectively, it’s best to generate it dynamically through a function. This allows you to create two separate files—one for the live environment and another for development. By doing so, you prevent duplicate websites from being indexed if a development site is ever hosted on a live server, protecting your search visibility and avoiding index bloat.
For any business managing a website, robots.txt can make a measurable difference. Combined with canonical tags, XML sitemaps, noindex directives, and strong internal linking, it helps strike the right balance between visibility and efficiency—ensuring search engines see your site the way you intend, and users find the pages that serve them best.
FAQs
Q: What is a robots.txt file?
A: Robots.txt is a text file placed at the root of your domain. It gives search engines instructions on which parts of your website they can or cannot crawl.
Q: Does robots.txt prevent pages from being indexed?
A: No. Robots.txt only blocks crawling, not indexing. If a blocked page is linked elsewhere, it can still appear in search results. To stop indexing, use a noindex tag or return the correct HTTP status code (404/410).
Q: Should I block CSS, JavaScript, or fonts in robots.txt?
A: No. Google needs to access CSS, JS, and fonts to render and evaluate your site for mobile-friendliness and Core Web Vitals. Blocking them can harm SEO.
Q: How should I handle UTM and tracking parameters?
A: You can disallow common tracking parameters like ?utm_ or ?ref= in robots.txt to reduce duplicate content. Avoid blocking all query strings globally, since some may be important for usability and SEO.
Q: What’s the best way to handle error or thank-you pages?
A: Use robots.txt to block crawling of error or utility pages, but also serve the correct status codes (404, 410, 403) or add a noindex tag to ensure they don’t appear in search results.
Q: Do small websites need to worry about crawl budget?
A: For most small and medium websites, crawl budget isn’t a major concern. However, keeping robots.txt clean still helps prevent index bloat and ensures search engines focus on your most valuable content.
Q: Does DBETA Bones generate robots.txt automatically?
A: Yes. DBETA Bones 8.0 automatically creates two versions of the robots.txt file — one for development and another for live environments. This ensures test sites don’t get accidentally indexed, while live sites remain fully optimised for search engines.
Let's talk about your project!