Discussion Resources
Source:DISCUSS.md
Scope note
This file captures research and discussion context, not authoritative product behavior. For the current implementation, start with README.md (humans) or AGENTS.md (agents), then read vignettes/rllmdoc.qmd, vignettes/spec-contract.qmd, and NEWS.md before using the material below as historical or exploratory context.
Read this first, before proceeding to the discussion points captured. Reference this during the continued conversation where relevent.
rdocdump:
- rdocdump: Dump R Package Documentation and Vignettes into One File • rdocdump
- Quick Start: dump R docs and vignettes to text files for LLMs • rdocdump
- Get Current rdocdump Repository Options — rdd_get_repos • rdocdump
- Set rdocdump Repository Options — rdd_set_repos • rdocdump
- Set rdocdump Cache Path in the Current R Session — rdd_set_cache_path • rdocdump
- Extract R Source Code from a Package — rdd_extract_code • rdocdump
- Dump Package Source, Documentationm and Vignettes into Plain Text — rdd_to_txt • rdocdump
- e-kotov/rdocdump: rdocdump: Dump ‘R’ Package Source, Documentation, and Vignettes into One File
llmstxt:
- Python source – llms-txt
- Python module & CLI – llms-txt
- The specification – llms-txt
- How-to help LLMs understand – llms-txt
Tokenization/embeddings:
Goal and Purpose of this R Package
The plan for this repo is actually larger in scope then just llms.txt. We are exploring additional documentation generation pipelines this package could automate. Furthermore, we are exploring more integrations this package would supplement or integrate related to quarto extensions for handling complex or custom quarto projects
Additional agent-optimized generations
Consider expanding this package to include the pipeline that is run by package rdocdump. First, please thoroughly research gh repo “e-kotov/rdocdump” to learn. This package is also installed to your environment so you can make a test call to see the produced output.
rdocdump would be used as part of a larger automation to tokenizes/vectorizes and produce text embeddings for literally everything about the package: This corpus (returned by rdocdump call) would include: roxygen, vigettes, source R code, even tests A seperate external process (not in scope) could then collect all agent optimized project level corpuses and centralize them in a single location. The value of rdocdump is supplementary/complementary and would not replace the generations for llms.txt files This new pipeline could also produce the same outputs for public cran packages, This capability means I could also have an embeddings/tokenized corpus for R packages that my codebase depends on In effect rllmdoc is a package designed for all the high value agent optimized documentation generations we’d want for any project (quarto or r-package) The current ones that exist now are really designed for publishing the agent optimized doc set along with the websites generated by quarto or pkgdown While rdocdump would support building a truly centralized internal corpus that captures all documentation for the totality of my codebase and used by any agent developer implementing any task or project in the artalytics organizational domain.
Additional functionalities to consider
See this discussion related to quarto projects and agent integrations. Evaluate whether tokenization/embeddings pipeline would be a quarto extension developed and supported by rllmdoc functions:
Note the discussion is private. Use gh cli which has authentication configured via env var GITHUB_ENV available in your environment.
Additional Internal Planning Documentation
Consider the following internal planning folder, which contains documents related to optimizing tokenizing and text embeddings for the artalytics codebase. We tracked these notes a long time ago, prior to the existence of this package, so it seems useful to revisit them now that we have a mechanism for including any developed implementations:
Locally at $AGENTS_HOME/plans