GraphRAG代码解析

GraphRAG一般会直接引导人使用命令行，但命令行需要openai或者azure的支持，配置特别麻烦，不如qwen的接口省事。但GraphRAG默认没有qwen的支持，所以把源码下载下来学习了一下。说实话，封装得很厉害，但功能点其实不多。使用也挺简单，在此简单列一下。

Poetry #

GraphRag使用了Poetry做Python的包管理。Poetry与GraphRAG相关的几个重要概念是：

Poetry基于项目目录下的配置文件建立虚拟环境。配置文件为pyproject.toml。
Poetry可以指定生成指令，Graphrag里就生成了一个指令：

[tool.poetry.scripts]
graphrag = "graphrag.cli.main:app"

指定了一个新的指令graphrag，调用graphrag/cli目录下的main.py。

Poetry有一个poe的扩展包，能够辅助调试。这也是graphrag官方给的执行指令的方式。在Graphrag的配置里，大概是这么写的：

[tool.poe.tasks]
index = "python -m graphrag index"
update = "python -m graphrag update"
init = "python -m graphrag init"
query = "python -m graphrag query"
prompt_tune = "python -m graphrag prompt-tune"
# Pass in a test pattern

官方给的做索引的指令是：poetry run poe index --root contents，这里有三步解析：

poetry run是指在poetry环境下执行指令，后面可以直接写python指令
poetry run poe是根据配置文件调用对应指令，这里其实就是调用poetry run python -m graphrag index --root contents
上述指令中的graphrag，调用的是前面第2步中的poetry.script，也就是graphrag.cli.main

模块调用的Workflow #

这个在代码里很让人困惑，一直以为是在配置文件里面写的，从Factory里没看到怎么得到的，最后发现，还是在index.workflows.factory.PipelineFactory里面。重要的是register_pipeline方法，这个方法就是在当前文件最后调用的，写死的流程。

_standard_workflows = [
    "create_base_text_units",
    "create_final_documents",
    "extract_graph",
    "finalize_graph",
    "extract_covariates",
    "create_communities",
    "create_final_text_units",
    "create_community_reports",
    "generate_text_embeddings",
]
_fast_workflows = [
    "create_base_text_units",
    "create_final_documents",
    "extract_graph_nlp",
    "prune_graph",
    "finalize_graph",
    "create_communities",
    "create_final_text_units",
    "create_community_reports_text",
    "generate_text_embeddings",
]
_update_workflows = [
    "update_final_documents",
    "update_entities_relationships",
    "update_text_units",
    "update_covariates",
    "update_communities",
    "update_community_reports",
    "update_text_embeddings",
    "update_clean_state",
]
PipelineFactory.register_pipeline(
    IndexingMethod.Standard, ["load_input_documents", *_standard_workflows]
)
PipelineFactory.register_pipeline(
    IndexingMethod.Fast, ["load_input_documents", *_fast_workflows]
)
PipelineFactory.register_pipeline(
    IndexingMethod.StandardUpdate,
    ["load_update_documents", *_standard_workflows, *_update_workflows],
)
PipelineFactory.register_pipeline(
    IndexingMethod.FastUpdate,
    ["load_update_documents", *_fast_workflows, *_update_workflows],
)

文件类别验证 #

    def _validate_input_pattern(self) -> None:
        """Validate the input file pattern based on the specified type."""
        if len(self.input.file_pattern) == 0:
            if self.input.file_type == defs.InputFileType.text:
                self.input.file_pattern = ".*\\.(txt|md)$"
            else:
                self.input.file_pattern = f".*\\.{self.input.file_type.value}$"

相关操作 #

数据存储。过程中使用了pandas的dataframe来处理。
create_final_text_unit 流程。用于文本切分，核心算法在graphrag/index/text_splitting/text_splitting.py。初步看了一下，应该是用的最大长度做的切分。
create_final_documents 流程。用于对文档关联、字段名等格式的整理。原始的Chunck信息里包含的内容有：ChunkID（TOKEN类型的），DocumentID（TOKEN的列表类型（不知道为啥这么设计，可能存在多个文档放一块儿里？实测没遇到多个的情况）），token数量，源文本。然后做了统一处理，与原始的document表做了联表处理，最后形成了有如下信息的表：

  DOCUMENTS_FINAL_COLUMNS = [
    ID,
    SHORT_ID,
    TITLE,
    TEXT,
    TEXT_UNIT_IDS,
    CREATION_DATE,
    METADATA
]

extract_graph 流程。抓到了LLM的尾巴。这里面用LLM提了Entity，使用的LLM是ModelManager().get_or_create_chat_model得到的。相关配置文件在内容目录（这个目录里的配置文件怎么来的？）。包含模型类型等，其中的API_KEY是作为变量放到配置文件中的，参数的值在内容目录的.env文件里。

语言模型 #

在graphrag/index/language_model下，通过factory.py生成相应的模型。目前默认有的Model包含：

ModelFactory.register_chat(
    ModelType.AzureOpenAIChat, lambda **kwargs: AzureOpenAIChatFNLLM(**kwargs)
)
ModelFactory.register_chat(
    ModelType.OpenAIChat, lambda **kwargs: OpenAIChatFNLLM(**kwargs)
)

ModelFactory.register_embedding(
    ModelType.AzureOpenAIEmbedding, lambda **kwargs: AzureOpenAIEmbeddingFNLLM(**kwargs)
)
ModelFactory.register_embedding(
    ModelType.OpenAIEmbedding, lambda **kwargs: OpenAIEmbeddingFNLLM(**kwargs)
)

有个好消息是，在GRAPHRAG的2.0.0版本后，可以自定义LLM，相关说明如下：

As of GraphRAG 2.0.0, we support model injection through the use of a standard chat and embedding Protocol and an accompanying ModelFactory that you can use to register your model implementation. This is not supported with the CLI, so you’ll need to use GraphRAG as a library.

Our Protocol is defined here

Our base implementation, which wraps fnllm, is here

We have a simple mock implementation in our tests that you can reference here

Once you have a model implementation, you need to register it with our ModelFactory:
class MyCustomModel:
    ...
    # implementation

# elsewhere...
ModelFactory.register_chat("my-custom-chat-model", lambda **kwargs: MyCustomModel(**kwargs))
Then in your config you can reference the type name you used:
models:
  default_chat_model:
    type: my-custom-chat-model


extract_graph:
  model_id: default_chat_model
  prompt: "prompts/extract_graph.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1