作者|鄭建華 更新|許嘯宇、張文驍、成誠OneFlow靜態圖的訓練效率遠高于動態圖(eager模式)。本文試圖通過一個簡單例子,結合v0.8.0版本的代碼,解讀一下靜態圖和運行時的實現機制。
在開始之前,建議先讀一下參考資料中《OneFlow框架的系統設計(https://zhuanlan.zhihu.com/p/337851255)》等系列文章。對靜態圖、運行時的基本概念和設計理念有基本的了解,會更容易理解代碼。
(相關資料圖)
1?
代碼示例
下面的示例代碼來自官方文檔(https://docs.oneflow.org/master/basics/08_nn_graph.html),是一個線性模型的前向計算。后續主要基于這段代碼進行分析。
import oneflow as flowimport oneflow.nn as nnclass ModuleMyLinear(nn.Module): def __init__(self, in_features, out_features): super().__init__() self.weight = nn.Parameter(flow.randn(in_features, out_features)) self.bias = nn.Parameter(flow.randn(out_features)) def forward(self, input): return flow.matmul(input, self.weight) + self.biaslinear_model = ModuleMyLinear(4, 3)class GraphMyLinear(nn.Graph): def __init__(self): super().__init__() # ModuleBlock self.model = linear_model def build(self, input): # ModuleBlock.__call__ return self.model(input)graph_mylinear = GraphMyLinear()input = flow.randn(1, 4)out = graph_mylinear(input)print(out)
2?
oneflow包的初始化
import oneflow在初始化包(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py)時,與靜態圖相關的主要操作如下:
GetEnv(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py#L228)
EnvGlobalObjectsScope::Init(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L126)
啟動各個節點的控制面(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L160-L162)網絡連接
初始化VM(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L180)
啟動各個節點的數據面網絡連接(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L184-L188)
初始化KernelObserver(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L192-L203)
NewDefaultSession(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py#L229)
RegsiterSession(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/multi_client_session.py#L39)?創建 Session,并注冊為 default session(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/session_util.cpp#L89)
創建 Python MultiClientSession 并保存到dict(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/session_context.py#L40),但并不 TryInit
創建 C++ MultiClientSessionContext(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/multi_client_session.py#L41)?但并不 TryInit
EnvGlobalObjectsScope::Init中先創建一個全局的ProcessCtx(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/env_global_objects_scope.cpp#L132)對象。然后根據環境變量等配置,在各個進程間創建gRPC和CommNet的連接,分別負責控制面和數據面的數據傳輸。其中在Bootstrap過程中會初始化全局的ProcessCtx(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/rpc/lib/grpc.cpp#L42),給每個進程分配一個全局唯一的rank編號(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/rpc/lib/global_process_ctx.cpp#L28)(machine_id(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/rpc/lib/global_process_ctx.cpp#L24))。
本文不涉及網絡層面的操作,只討論同一進程內各線程間的交互。
3?
Module類
雖然可以直接用op和tensor構造模型,但是op的粒度太細了,直接用op構造模型會比較繁瑣。
Module(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/module.py#L54)是由op和tensor構成的、可復用的子模塊。利用Module可以更高效、更快捷的構建復雜模型。oneflow.nn(https://github.com/Oneflow-Inc/oneflow/blob/d825243aa7aff5cba8bd3a901b4cc56c2b1a36af/python/oneflow/nn/__init__.py)模塊導出了很多預定義的Module。
Module定義了自己的屬性設置邏輯(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/module.py#L262),核心邏輯是
如果value是Parameter類型,就保存到Module._parameters中
如果value是Module類型,就保存到Module._modules中
如果value是Tensor類型,就保存到Module._buffers中
否則按常規屬性處理
Module可以包含子Module,形成樹結構。因為Module通過setattr將子Module和Parameter都保存到字典結構中,可以方便的遍歷所有Module及其參數tensor。
4?
Graph類
4.1 構造函數
Graph的構造函數中GetDefaultSession(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L145)得到的session,就是導入oneflow包時NewDefaultSession(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/__init__.py#L229)構建的session。當時沒有初始化,而是在Graph構造時進行初始化(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L147)。對應的C++函數是MultiClientSessionContext::TryInit(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/multi_client_session_context.cpp#L67),執行時會創建各種全局的資源管理器,比如: ?
LazyJobBuildAndInferCtxMgr
BufferMgr
RegstMgr
ActorMsgBus
ThreadMgr
4.2?__setattr__: 將Module和Tensor封裝為Block
Graph.__setattr__ 支持通過設置屬性的方式把一個 Module 添加到 Graph 中,之后改 Module 就可以被 Graph 調用了。添加到 Graph 中的 Module,會被包裝到 Block 里面,Block 起到了代理執行的作用,它會給原 Eager 下的 Module 擴展出靜態執行需要的一些特殊功能。
添加到 Graph 中的 Module 和原 Module 共享了狀態(Parameter、Buffer)和 forward 執行邏輯。共享 forward 執行邏輯使得靜態和動態執行計算邏輯相同。共享狀態則可以使動態圖下的模型狀態被靜態圖復用?;诖?,兩個 Graph,一個用于訓練,一個用于預測,他們都復用統一模型 Module,這樣訓練和預測 Graph 也就實現了模型共享。
setattr最重要的動作就是對_add_block的調用(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L1332),_add_block中主要是調用get_block_cls并保存結果(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L1326)。get_block_cls(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L39)的作用是將Module及其所有Tensor屬性都轉為對應的Block對象。為什么要做這個動作呢?主要是靜態圖編譯需要借助Block類型來實現代理執行的功能,這些功能不適合直接寫到 eager 下的 Module 和 Tensor 上。
這個轉換是在ModuleBlock構造時調用set_origin(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L131)完成的。對于子Module,會遞歸調用get_block_cls函數(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L145),這樣所有子Module及其Tensor屬性都會被轉換為對應的Block對象。
所以,上述示例代碼中,GraphMyLinear實際存儲的是ModuleBlock,Graph.build執行時獲取的model屬性也是ModuleBlock對象,ModuleBlock.origin才是ModuleMyLinear。
Graph.__setattr__不允許將Tensor對象設置為屬性(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L1340)。Tensor只能存到Module中,因為 Module 是做狀態共享的基本單位,而 Graph 是不允許復用的。
4.3 針對不同任務,定義不同的計算圖
根據Oneflow Model Zoo的模型示例(https://github.com/Oneflow-Inc/models/blob/1b291f78d8f60e5f04ee0c5962e4611cc4bab40a/Vision/classification/image/alexnet/graph/train.py),train/eval等階段可以創建不同的Graph子類。動態圖下提供了 Module、Optimizer、Dataloader等模塊,這些模型都可以被添加到 Graph 中。不同的組合可以構建不同類型的任務。
在這些不同階段,Graph構造函數的行為、build函數的輸入輸出都有各自特點。了解這些,看后續代碼時會更容易理解各個參數的具體含義。
構造函數
train階段,需要添加Module、損失函數、優化器和dataloader
eval階段,只需要添加Module和dataloader
build函數
train
導入樣本和label
調用Module得到前向計算結果
計算損失
計算梯度
返回loss
eval
導入樣本和label
調用Module得到預估結果
返回預估結果和label
4.4 小結
上述幾個類型的關系如下:
下面描述了GraphMyLinear的構造流程
* `__init__` * `Graph.__init__` * self.model = linear_model * `Graph.__setattr__` * _add_block * get_block_cls: 遞歸地把Module轉為ModuleBlock * `ModuleBlock.__init__` * ModuleBlock.set_origin * `ModuleBlock._origin = origin` (Module) * 對origin的sub modules, parameters, buffers遞歸調用get_block_cls * `ModuleBlock.__setattr__`
5?
邏輯圖的編譯
計算機語言的編譯,是將高級語言的語句編譯為匯編或機器指令。深度學習框架對計算任務的編譯,是將用戶的特定語句操作轉換為DAG圖。oneflow中用Job(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job.proto#L30)描述邏輯的計算圖。
不同于eager模式的動態圖,靜態圖在開始執行前可以得到整個計算任務的所有信息,可以對DAG進行多輪優化。每輪優化都是輸入一個Job、得到一個新Job。
最后,根據分布式環境配置,將邏輯圖Job轉換為物理執行的計算圖Plan(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/plan.proto#L34)。在物理圖中,一個op可能分布在多個節點/進程。
啟動DAG計算需要調用Graph.__call__,這個函數的執行主要分以下幾個步驟:
__call__
_compile(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L221)?if not _is_compiled
build_graph(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L741)
__build_graph(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L759)
finish_complie_and_init_runtime(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L742)
__run(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L226)
邏輯圖編譯主要在__build_graph中進行。finish_complie_and_init_runtime會繼續做一些優化pass,然后構建物理圖、初始化運行時Actor系統。__run會啟動一次DAG的運算。
5.1 graph_build_context: 為邏輯圖編譯設置基本環境
在 Graph 中,build 函數里面的代碼執行都在 graph_build_context 的作用域下,這樣實現了動態轉靜態的功能。
__build_graph中的graph_build_context(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L851)雖然只有一行代碼,但卻做了幾件非常重要的事情。
首先在context作用域內設置全局的lazy_mode為True(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L46)。在這個context作用域內,所有op都由LazyInterpreter解釋執行。
其次,在JobBuildAndInferCtx(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L47)作用域內,JobBuildAndInferCtx_Open(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L57)調用類似如下C++代碼
// oneflow/api/python/job_build/job_build_and_infer.h// oneflow/core/job/job_build_and_infer_ctx_mgr.cpp// 如前所述,LazyJobBuildAndInferCtxMgr 在 MultiClientSessionContext::TryInit 執行時初始化。// LazyJobBuildAndInferCtxMgr mgr;mgr.OpenJobBuildAndInferCtx(job_name);
OpenJobBuildAndInferCtx會新建一個Job對象(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx_mgr.cpp#L32)、一個LazyJobBuildAndInferCtx對象(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx_mgr.cpp#L34)。LazyJobBuildAndInferCtx負責根據用戶定制的op等操作,修改Job,其中最主要的功能是添加新 Op。
5.2 __build_io:為計算圖添加input和output Op
self.__build_io("input",?graph_build_util.build_graph_input_arg,?*args,?**kwargs)
上面這行代碼(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L854-L856)的作用是,對于用戶傳遞給graph_mylinear(input)的input參數,針對其中的每個tensor都在邏輯計算圖中插入一個FeedInputOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/system_ops.h#L48)節點。也就是說,model的輸入(比如樣本tensor,具體參考4.3節),在靜態圖中也視為一個op操作。
__build_io內會用args(即input)和kwargs構造一個ArgsTree。ArgsTree 把 Python 下的輸入、輸出抽象成了一個樹,輸入、輸出可以是嵌套的 Tuple、List、Dict,元素是 Tensor,嵌套的結構剛好可以表示為樹,而 Tensor 是樹中的葉子節點。示例代碼中kwargs是空的。
遍歷ArgsTree,對args和kwargs的每個tensor都調用傳入的build_func,對于input來說,就是build_graph_input_arg(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L206)。后面會看到,model的output也會調用__build_io,所以這個函數名的意思應該就是對model的輸入、輸出進行靜態圖的構圖工作。
build_graph_input_arg內部會構造一個FeedInputOpExpr(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L213),提交給解釋器執行。因為是在lazy作用域內,由LazyInterpreter解釋執行(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L471),LazyInterpreter會將對應的op插入靜態圖。
附:build input時ArgsTree的內部結構
__build_io(input)?中 ArgsTree 的內部數據組織示意
_named_io_args: NamedArg
_value: tuple
[0]: NamedArg
_value: tuple of NamedArg
[0]: NamedArg
_value: args tensor from?Graph.__call__
[1]: NamedArg
_value: empty kwargs from?Graph.__call__
通過pdb命令可以查看變量:?p args_tree._named_io_args._value[0]._value[0]._value.to_numpy()
5.2.1 將op添加到邏輯圖
LazyInterpreter::ApplyImpl(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L471)在執行時,GetCurInferCtx()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L500)返回的就是graph_build_context中OpenJobBuildAndInferCtx(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L57)創建的那個LazyJobBuildAndInferCtx對象,這個對象負責邏輯圖的構建。添加op的主要調用流程如下:
infer_ctx->AddAndInferConsistentOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L503)
AddAndInferOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx.cpp#L563)
ConstructOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx.cpp#L580)
CheckAndConstructOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/operator.cpp#L1216)
NewObj(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/operator.cpp#L51)
OperatorConf中,多種op配置共享op_type字段(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/op_conf.proto#L412),protobuf oneof的op_type_case常量作為注冊NewObj的key。
系統預定義的op在oneflow/core/operator(https://github.com/Oneflow-Inc/oneflow/tree/release/v0.8.0/oneflow/core/operator)下,例如UserOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/operator/user_op.h#L24)。
AddAndInferOp將返回的Operator保存到LazyJobBuildAndInferCtx的字典中。后續的函數調用,主要是進行推導并修改靜態圖Job,使得各個節點構成一個DAG。
JobBuildAndInferCtx相關的類關系如下:
5.2.2 lazy tensor 和 eager tensor 的區別
LazyInterpreter::ApplyImpl的最后,會調用BuildTensor(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L518)構造一個lazy tensor,作為build_graph_input_arg的返回值(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L216)。所以__build_io返回的lazy_args(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L854)是lazy tensor,它將替代eager的args(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L828)(也就是用戶輸入的input)參與后續的計算圖構建。
那么lazy tensor和eager tensor的區別是什么呢?eager tensor是要即時計算的,所以需要真實數據;而lazy tensor僅在靜態圖編譯階段用于推導,只需要描述性質的元信息。靜態圖編譯是在lazy模式下運行,只是使用lazy tensor 做計算機構圖和校驗。
后面會看到,靜態圖的運行期已經沒有tensor的概念。運行期看到的只是更廣義的Regst存儲,可能代表tensor/blob,也可能是其它控制信息。靜態圖運行時的輸入,是直接讀取外部 eager tensor的內存數據到到regst;輸出應該是op寫到regst,通過blob構造eager tensor。
5.3 build: 將UserOp和FeedVariableOp添加到邏輯圖
__build_graph中的self.build()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L861)會調用GraphMyLinear.build(),以及ModuleMyLinear.forward()。因為是在lazy模式下運行,matmul和add都會調用UserOpExpr重載版本的LazyInterpreter::ApplyImpl(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L832),進而調用AddAndInferConsistentOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L940)進行構圖操作。
需要說明的是,在引用Module的Parameter屬性時(如weight/bias),會觸發FeedVariableOp的構圖操作(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L226)、調用對應版本的LazyInterpreter::ApplyImpl(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L527)。這個是怎么執行的呢?
__build_graph中,在進入lazy模式之前,先調用了_create_states_builder(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L843)。其中self._state()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L667)返回所有Module的所有Parameter(包括子Module)。
state_block的類型是TensorBlock(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L631)。所有的state_block的lazy_origin_builder().method(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L647)都被設置為調用build_graph_state(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L683-L688)。
給build_graph_state(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L220)設置個斷點能讓整個調用過程顯形,主要的調用棧如下:
-> out = graph_mylinear(input) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(221)__call__()-> self._compile(*args, **kwargs) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(741)_compile()-> _, eager_outputs = self.build_graph(*args, **kwargs) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(759)build_graph()-> outputs = self.__build_graph(*args, **kwargs) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/graph.py(864)__build_graph()-> outputs = self.build(*lazy_args, **lazy_kwargs) /mnt/project/machine-learning/oneflow/oneflow/test.py(21)build()-> return self.model(input) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(234)__call__()-> result = self.__block_forward(*args, **kwargs) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(266)__block_forward()-> result = self._origin.__class__.forward(self, *args, **kwargs) /mnt/project/machine-learning/oneflow/oneflow/test.py(11)forward()-> return flow.matmul(input, self.weight) + self.bias /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(483)__getattr__()-> p_state = self._get_from_states(name, "_parameters") /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(521)_get_from_states()-> _s_block.try_build() /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(679)try_build()-> self._lazy_origin_builder.try_build(self) /usr/local/lib64/python3.6/site-packages/oneflow/nn/graph/block.py(627)try_build()-> self.result = self.method()> /usr/local/lib64/python3.6/site-packages/oneflow/framework/graph_build_util.py(227)build_graph_state()-> op_name, var_conf_str, ["in_0"], ["out_0"]
這個調用過程比較容易困擾的是,執行對象會在Grpah、GraphMyLinear、ModuleMyLinear、ModuleBlock之間切換。
前面在討論Graph的構造時已經提過,執行self.model(input)時,Graph.__getattr__返回的屬性model是ModuleBlock對象,所以實際調用的是ModuleBlock.__call__。
在這個函數內調用__block_forward(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L234),其中的_origin(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L266)是ModuleMyLinear,進入到它的forward方法,執行到flow.matmul(input, self.weight) + self.bias時,matmul 會被LazyOpInterpreter 所執行,在 LazyOpInterpreter 中調用 AddAndInferConsistentOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L503)
,在 Job 中添加一個 matmul operator。同理后面的加法會在 job 中添加一個 add operator。
self.weight 和 self.bias 會觸發調用ModuleBlock.__getattr__,進而調用_get_from_states(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L483),調用TensorBlock.try_build()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/block.py#L521)。這里執行的就是進入lazy模式之前設置的build_graph_state(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/framework/graph_build_util.py#L220)。從而增加一個FeedVariableOp到計算圖(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L527)。為什么設置和調用會距離這么遠呢?主要是為了讓參數盡量和消費參數的 Operator 在一個作用域下,所以實現成了惰性求值來達到延遲計算的目的。
再后面的步驟就是調用__build_io(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L869-L875)插入FetchOutputOp(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp#L589)。也就是說,獲取model的output也是一個op。
到目前為止,前向計算圖就構建完成了。它的json表示可以參考附錄。net.op是計算圖的節點,通過input等屬性可以看出節點之間的連接關系。
示例代碼的前向計算圖如下。從這個圖可以看到,input、output、weights等都是op。
5.4 邏輯圖優化
在__build_graph中會調用CurJobBuildAndInferCtx_Complete對靜態圖進行多輪優化(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L923),對應的C++函數是LazyJobBuildAndInferCtx::Complete()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/job_build_and_infer_ctx.cpp#L975)。
這之后生成的Job是full_job。本文的示例代碼比較簡單,并不是典型的計算場景,其forwar和ful計算圖的拓撲是一樣的。實際大部的圖優化都實現在這個階段,如 Op fusion、AMP、ZeRO、常量折疊等等。
到這里,邏輯圖構建的主體部分就結束了。
隨后會構建一個CNNGraph對象(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L947),對應的C++類型是NNGraph(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.h#L33)。這個對象將負責構建物理計算圖Plan。它也是整個運行時的擁有者和維護者。這個對象析構時,整個運行時也會有序終止并釋放資源。
5.5 物理圖的編譯
接下來就是執行finish_complie_and_init_runtime(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L742),其中的核心調用是self._c_nn_graph.complie_and_init_runtime()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/python/oneflow/nn/graph/graph.py#L802),對應的C++函數是NNGraph::CompileAndInitRuntime(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L265)。
在這個函數中,JobCompleter().Complete()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L280)會繼續對邏輯圖做幾輪修改優化,補全 Runtime 執行所需要的附加信息,Compiler().Compile()(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L285)將邏輯圖轉為分設備的物理圖,并繼續對Plan進行修改優化。
Plan的編譯是在master節點進行的(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L282)。master節點會將Plan通過gRPC推送給各個worker節點(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L308),worker節點從master拉取物理計算圖(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L310)。
之后調用NewRuntimeBuffers創建Buffer對象(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L322),Buffer應該是主要用于進程內的信息同步。
然后就準備初始化運行時了。
示例代碼生成的compiled_job和物理圖Plan的json參見附錄。
最終生成的compiled邏輯圖如下??蚣茏詣硬迦肓撕芏嘞到y控制節點。
5.6 Plan的結構
示例代碼輸出的Plan json數據見附錄。
Plan在邏輯上和compiled_job是等價的。這里主要關注task/op之間的關系。
Plan.task中的每個元素是一個task,其中的exec_sequence.exec_node對應job中的op,通常只有一個op(數組可以支持sub graph)。
exec_node.kernel_conf.op_attribute描述了op信息。其中op_conf包含op name信息。
kernel_conf.op_attribute.op_conf就是Job中的OperatorConf。
kernel_conf.op_attribute.arg_signature.bn_in_op2lbi體現了task/op之間的連接關系。
bn_in_op就是blob name in op,即op輸入的blob name。
以System-AutoTick-DstSubsetTick_21為例
{ "out": { "op_name": "System-AutoTick-DstSubsetTick_21", "blob_name": "out" }, "in_0": { "op_name": "System-EagerCriticalSection-Interface-End-Tick-19", "blob_name": "out" }, "in_1": { "op_name": "System-AutoTick-SrcSubsetTick_20", "blob_name": "out" }}
exec_node.bn_in_op2regst_desc_id在task層面體現了連接關系。這個map中的key表示輸入輸出,value是register id。
{"out": "29","in_0": "27","in_1": "28"}
task.produced_regst_desc描述了對應task生產的register,consumer_task_id是消費者,
produced_regst_desc.out.regst_desc_type.data_regst_desc.lbi2blob_desc.lbi就是這個register的logic blob id。
task.consumed_regst_desc_id描述了對應task消費的register信息
6?
運行時的初始化
NNGraph::CompileAndInitRuntime中,new Runtime這行代碼會初始化運行時(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/framework/nn_graph.cpp#L331)。主要做的事情包括:
創建Thread
通知Thread創建Actor,Actor會創建Regst和Kernel
給沒有輸入的source_tasks發送啟動信號kStart
6.1 Runtime創建Thread
在Runtime的構造函數中,DumpThreadIdsFromPlan(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L65)會將Plan中屬于當前進程的task的thread id存入thread_ids_變量。AddThreads創建這些Thread對象(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L69)。
Thread在構造時會創建一個物理線程(?https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/thread/thread.cpp#L39),線程執行的是PollMsgChannel方法(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/thread/thread.cpp#L44),Thread就是在這里持續等待需要處理的新消息。
Thread只處理兩類命令消息:線程終止消息,創建Actor的消息。其它消息交給Actor::ProcessMsg處理(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/thread/thread.cpp#L83)。
6.2 Runtime通知Thread創建Actor
在Runtime的構造函數中,tasks被分為兩類:source_tasks和other_tasks。在示例代碼中,source_tasks(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L84-L85)是沒有輸入邊的task。
從代碼邏輯看,在Plan proto中,task的consumed_regst_desc_id字段是一個map。如果這個map的所有key都是in_ctrl(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L54),這個task就是source_tasks。
一些source_tasks的示例如下:
System-Src-WaitAndSendIds_16
System-AutoTick-AppendDeviceTick_9
System-EagerCriticalSection-Interface-End-Tick-19
System-EagerCriticalSection-Interface-End-Tick-25
Runtime調用HandoutTasks函數(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L100-L101)會給ActorMsgBus發送構建Actor的kConstructActor消息(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/runtime.cpp#L49)。
6.3 ActorMsgBus和Thread的消息處理
從接口看,ActorMsgBus?(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L24)負責消息的發送(Actor通過ActorMsgBus發送消息),Thread::PollMsgChannel(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L60) 負責消息的接收和處理。
相關實體的協作關系如下
Actor是自調度的基本單元,接受消息然后工作,工作完后再繼續發送消息。
actor_id就是task_id,是在編譯Plan時就確定的。task是編譯時概念,actor是對等的運行時概念。
task_id有特定的編碼格式(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/graph/task_id.cpp#L21-L29),從中可以解析出machine_id(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/graph/task_id.cpp#L73)和thread_id(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/graph/task_id.cpp#L77)。
在跨網絡的整個物理圖Plan中,actor id相當于地址,通過它可以定位唯一的actor實體。
Actor 通過 ActorMsgBus::SendMsg(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L24) 發送 ActorMsg(https://github.com/Oneflow-Inc/oneflow/blob/4856d691051accd72f13f4139d281e411977b297/oneflow/core/lazy/actor/actor_message.h#L34) 消息。
ActorMsg包含源和目的actor id(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message.h#L84-L85)。
如果是進程內通訊(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L26),將通過 ActorMsgBus::SendMsgWithoutCommNet?(https://github.com/Oneflow-Inc/oneflow/blob/4856d691051accd72f13f4139d281e411977b297/oneflow/core/lazy/actor/actor_message_bus.cpp#L49)把 ActorMsg 朝目的 actor 所在的 thread 入隊消息(https://github.com/Oneflow-Inc/oneflow/blob/4856d691051accd72f13f4139d281e411977b297/oneflow/core/thread/thread.h#L40)。
Thread::EnqueueActorMsg 會判斷當前 thread 是否是 actor thread,如果是則入本地隊列,否則則入 actor thead 的 channel 隊列。
如果ActorMsg是跨進程消息,ActorMsgBus通過CommNet發送消息(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor_message_bus.cpp#L42-L44),接收方的CommNet應該會根據actor id獲得線程id,從ThreadMgr查到Thread,將消息交給Thread處理。
Thread::PollMsgChannel(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L60) 負責消息的接收和處理。
如果線程本地隊列local_msg_queue_為空,則從thread的channel隊列中取出全部ActorMsg放入本地隊列(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L63)。
從本地隊列中取出一個ActorMsg,然后開始處理。
處理一些特殊的kCmdMsg消息(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L67-L79),然后普通消息交給Actor自行處理(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/thread/thread.cpp#L83)。
Actor收到消息后,會判斷是否滿足了Act的條件,如果滿足,則會執行Act,從而調用LaunchKernel執行計算,Act執行結束后通過ActorMsgBus發消息通知上下游Actor。
這些對象之間的消息傳遞關系如下圖所示
6.4 激活source Actor
目前的實現中,Actor全部是自調度的,只能接受來自其他Actor的消息。Actor中有一類比較特殊的source actors,它們與source tasks對應。
source actors 沒有上游 actor,它們會朝下游actor發送消息從而激活所有的Actor運行。
source actors 本身是如何執行的呢?它們在接受到 kStart 消息后就會一直 Act 直到進入退出流程。但是其 kernel 會阻塞在 Buffer(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/common/buffer.h#L26) 處,一直等待其他線程往 buffer 中添加數據后,阻塞會被激活,然后 kernel 執行讀取,kernel 完成后,actor 的 Act 結束,往下游發送消息。
source actors 由于會發生阻塞,所以其必須有單獨的 actor thread。
Runtime 初始化的的最后一步就是朝各 source actors 發送 kStart 消息用以激活它們,但 source actors 只有接受到 buffer 的數據后才會往下執行,然后朝下游 actors 發送消息,使所有的 actors 都執行起來。
7?
Actor
7.1 Actor的創建
Thread在創建Actor時,會先嘗試創建為LightActor(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L104),如果不成功,再嘗試用預先注冊的工廠創建Actor。
有幾種TaskType可以用于LightActor(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/light_actor.cpp#L677-L689):
kNormalForward,比如matmul、add等user op。
kCopyHd
kTick
kCollectiveBoxingGeneric
目前大約有20多種Actor的子類型。其它Actor類型根據TaskType(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/job/task.proto#L8)預先注冊。例如WaitAndSendIdsActor。
示例代碼的各個節點對應的actor類型參見附錄。
Actor相關的類關系如下(包含關系只是表示可以訪問到相關信息,并不意味著創建或著擁有該類型對象)
7.2 Actor的初始化
Actor的構造函數一般都是空的,構建之后需要執行Init(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L129)函數進行初始化。
LightActor繼承自ActorBase,不是Actor的子類,有自己的Init函數實現。這里只討論Actor的初始化。
在Actor::Init(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L129)中,首先調用ConstructKernel(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L138)創建kernel實例。和Operator類似,kernel也是以OpTypeCase作為注冊的key,例如WaitAndSendIdsKernel(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L51)。一個Actor通常只有一個kernel。
之后調用NewRegsts創建Regst(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L152)。Tensor是用戶側的概念。對應的運行時概念是Regst(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/register/register.h#L24),它持有Kernel需要讀寫的內存。Regst的概念比Tensor更寬泛,比如框架自動添加的控制Op也會用到Regst。
Actor將自己創建的Regst保存到produced_regsts_(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L153)。
TakeOverNaiveConsumed(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L182)只記錄需要消費的regst id,但并不push到consumed_regsts_。
TakeOverNaiveProduced(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L183)既記錄生產的regst id,也push到naive_produced_rs_(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L249)。這種區別是為了首次執行計算時,actor能順利執行。后面分析Actor的消息處理時會再回過頭來討論一下。
調用InitBnInOp2BlobInfo會初始化BlobInfo(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L184)。
之后就是調用VirtualActorInit(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.cpp#L185),這里允許各個Actor子類定制自己的初始化邏輯。通常會調用OF_SET_MSG_HANDLER宏(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.0/oneflow/core/lazy/actor/actor.h#L76-L80)設置Actor的消息處理函數。
7.3 Actor的消息處理
LightActor 首先會根據消息類型分別處理 kRegstMsg 和 kEordMsg 消息。HandleRegstMsg(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/lazy/actor/light_actor.cpp#L424) 中根據 RegstMsg 的 type (kProduced 或 kComsumed) 來分別處理各種讀寫狀態計數。
然后判斷讀寫計數是否達到了判斷條件,如果達到了意味著滿足了讀寫 regst 的條件,然后就 執行 ActOnce(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/lazy/actor/light_actor.cpp#L451)。
LightActor::ActOnce 會在第一次執行時去 InitBnInOp2Blob 和 InitActMsg。InitBnInOp2Blob 初始化 resgt 中的 bn 與 Blob 的映射關系,為 kernel 提供通過 bn 訪問 Blob 的功能。InitActMsg 會初始化好所有需要發送的消息避免后繼發消息時重復的構建消息。
然后就是 LaunchKernel,接著會 ResetState 重置 regst 狀態。
LaunchKernel 后就會把之前構建好的消息發送出去,同步消息會直接入隊 thread 消息隊列,異步消息通過 callback 發送到 ActorMsgBus。
普通 Actor::ProcessMsg 會調用 msg handler 來處理消息,最常見的 msg handler 就是 Actor::HandlerNormal(https://github.com/Oneflow-Inc/oneflow/blob/release/v0.8.1/oneflow/core/lazy/actor/actor.cpp#L329)。
Actor::HandlerNormal 中流程跟 LightActor 中類似,會根據不同的 regst 類型來分別處理,Actor 中對 regst 的狀態管理方式與 LightActor 不同,LightActor 中的方式更加高效,Actor 中能處理一些特殊情況。
消息處理完畢后,就會調用 ActUntilFail,ActUntilFail 會判斷 IsReadReady 和 IsWriteReady 來決定是否可以進行 Act。
最常見的 NaiveActor::Act() 就是執行 AsyncLaunchKernel。
Act 完成后,就開始朝上下游發送 regst 消息。
還有一些特殊的 Actor,我們以WaitAndSendIdsActor為例,觀察一下這類Actor的消息處理機制。
之所以選擇這個例子,一是這個Actor比較簡單;二是這是一個典型的source task,想看一下計算圖是怎么被觸發啟動計算的。
Thread收到的消息如果不是kStopThread或kConstructActor,就調用Actor::ProcessMsg(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L83),將消息轉給Actor處理。
ProcessMsg函數只是簡單的將消息轉給handler處理(https://github.com/Oneflow-Inc/oneflow/blob/b6bf3f8843679111eb1edf79deefce814d250f4e/oneflow/core/lazy/actor/actor.h#L38)。
WaitAndSendIdsActor::VirtualActorInit中,handler被設置為HandlerWaitToStart(https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp#L53)。
Runtime的構造函數中,發送的第一批消息是給source_tasks的kStart消息,這個消息就由HandlerWaitToStart函數處理。
HandlerWaitToStart校驗消息類型后,將handler設置為HandlerNormal(https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/job/runtime.cpp#L109)(這也是大部分Actor的默認handler),然后調用ProcessMsg(https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp#L74),實際就是調用新設置的handler HandlerNormal。
HandlerNormal中,如果是kCmdMsg,只允許是kStart(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L377)。通過消息類型校驗后,會直接調用ActUntilFail(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L378)。
7.4 Act執行的條件
LightActor 和 Actor 判斷能否進行 Act 采用了不同的策略,LightActor 的效率更高,Actor 能處理一些特殊情況。
對于 LightActor,當在讀的register計數 total_reading_cnt_ 歸 0,可消費的register計數 ready_consumed_ 增加到 max_ready_consumed_,前者表示所有的消費者已經讀取當前 LightActor 的 Regst,后者表示當前 LightActor 消費的所有 Regst 已經到達(由上游發送的 Regst 消息)。
對于 Actor,Actor::ActUntilFail中,Act方法(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L424)是各個子類自己實現的,一般主要是啟動kernel計算。
但是在執行Act之前,需要先確認:
Act執行依賴的數據是否都已經就緒?(IsReadReady)
Act生產出來的數據,消費方是否已經用完、并收到ack消息確認?(IsWriteReady)
Actor有4個與此相關的成員變量
RegstSlot naive_produced_rs_;
RegstSlot inplace_produced_rs_;
RegstSlot naive_consumed_rs_;
RegstSlot inplace_consumed_rs_;
xx_produced_rs_存儲的是當前Actor的下游consumer返回的、已經使用完畢的ack regst信息。(當前Actor生產的Regst存儲在produced_regsts_中。)
運行時在初始化的過程中,所有Actor都沒有運行過,任何Actor都不可能收到ack消息,所以在Actor初始化時,要預先填充xx_produced_rs_,這樣才能保證Actor在首次運行前是WriteReady的,才能順利啟動執行。
xx_consumed_rs_存儲的是上游依賴發來的數據。它不需要預先填充。因為source_tasks沒有輸入依賴,自然就是ReadReady的;而xx_produced_rs_在初始化時的預先填充又保證它是WriteReady的,所以source_tasks可以直接運行。source_tasks的輸出消息發給下游,下游也會變為ReadReady,而下游在初始化后也保證是WriteReady的。整個Actor系統就可以這樣運轉起來了。
7.5 Actor上下游之間的通知機制
Act執行完畢后,需要將結果數據發給下游consumer。以 WaitAndSendIds 的 Naive Produced 為例,ActUntilFail中的調用流程如下:
AsyncSendNaiveProducedRegstMsgToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L427)
VirtualAsyncSendNaiveProducedRegstMsgToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L441)
HandleProducedNaiveDataRegstToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L446)
HandleRegstToConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L577)
EnqueueAsyncMsg(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L523)
如果目標線程是當前線程,ActorMsgBus::SendMsg(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L662)
否則,將消息加入async_msg_queue_(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L664)
增加 total_reading_cnt_(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L526)(這個變量表示已經發消息給下游、但未收到的ack數量)
naive_produced_rs_.PopFrontRegsts(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L581)
AsyncSendProducedCtrlRegstMsgToConsumer
注意naive_produced_rs_.PopFrontRegsts(https://github.com/Oneflow-Inc/oneflow/blob/06a6af1c7f760ba4b12d2dfb8f73d7fda5c7dbab/oneflow/core/lazy/actor/register_slot.cpp#L53)會將Regst指針從隊列中刪掉,相應的可用(https://github.com/Oneflow-Inc/oneflow/blob/06a6af1c7f760ba4b12d2dfb8f73d7fda5c7dbab/oneflow/core/lazy/actor/register_slot.cpp#L49)register計數減1(https://github.com/Oneflow-Inc/oneflow/blob/06a6af1c7f760ba4b12d2dfb8f73d7fda5c7dbab/oneflow/core/lazy/actor/register_slot.cpp#L49)。
而在Actor::HandlerNormal中處理收到的kRegstMsg消息(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L340)時,如果是consumer發來的ack消息,會調用TryUpdtStateAsProducedRegst(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L355),將Regst再添加到 naive_produced_rs_ 中(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L654),以保證當前Actor在收到所有ack后是WriteReady的;同時遞減在讀的 register 計數total_reading_cnt_。
Actor對依賴的上游消息的處理是類似的。通過以下函數調用給上游發送ack消息、通知 register 已經用完,可以繼續更新了:
AsyncSendNaiveConsumedRegstMsgToProducer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L431)
AsyncRetInplaceConsumedRegstIfNoConsumer(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L432)在Actor::HandlerNormal中收到kRegstMsg消息后,將消息添加到consumed_rs_(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L344),以保證當前Actor在收到所有依賴數據后是ReadReady的。
LightActor有自己的消息處理機制(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/light_actor.cpp#L299),大致原理應該是差不多的。
7.6 Act執行的動作
根據上述討論,Actor收到kRegstMsg后也會進入ActUntilFail執行。如果讀寫都是Ready,就執行Act(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L424)。以WaitAndSendIdsActor為例,主要調用鏈路如下:
AsyncLaunchKernel(https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp#L58)
ek.kernel->Launch(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L562),啟動Kernel計算
Forward(https://github.com/Oneflow-Inc/oneflow/blob/eae9ff38f074479d79ce24b0f6e0594f82126171/oneflow/core/kernel/kernel.cpp#L52)
ForwardDataContent(https://github.com/Oneflow-Inc/oneflow/blob/eae9ff38f074479d79ce24b0f6e0594f82126171/oneflow/core/kernel/kernel.cpp#L65)
buffer->Pull(https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L40)
給regst的存儲地址mut_dptr賦值(https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L47)
buffer->Pull會等待條件變量的通知(https://github.com/Oneflow-Inc/oneflow/blob/49f60e682518436dfeb37344a15902a959e0e4f2/oneflow/core/common/buffer.h#L60)?,F在,看上去所有Actor都已準備就緒,只等發令槍一響就開跑了。
8?
啟動靜態圖的計算
Graph.__run(https://github.com/Oneflow-Inc/oneflow/blob/81edd938826a7ea903174d682348847658b64653/python/oneflow/nn/graph/graph.py#L226)會扣動發令槍的板機,啟動計算圖的一輪計算。
主要調用流程如下:
RunLazyNNGraph(https://github.com/Oneflow-Inc/oneflow/blob/81edd938826a7ea903174d682348847658b64653/python/oneflow/nn/graph/graph.py#L1076)
builder->LaunchLazyJob(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L568)
LaunchLazyJobInstructionType(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/instructions_builder.cpp#L179)
Buffer::Push(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/instructions_builder.cpp#L179)
這里的Buffer::Push就是WaitAndSendIdsKernel在等待的起跑信號。
9?
運行時的退出機制
整個運行時包含很多對象和資源,安全有序的退出是龐雜而又細致的工作。這里僅以WaitAndSendIds為例,從一個側面觀察一下運行時的退出機制。
運行時的退出始于NNGraph對象的析構(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L76)。
9.1 Actor的退出
NNGraph在析構時,會關閉所有的Buffer對象(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L82)。
Buffer在關閉時,會設置is_closed_ = true并通知所有監聽者(https://github.com/Oneflow-Inc/oneflow/blob/49f60e682518436dfeb37344a15902a959e0e4f2/oneflow/core/common/buffer.h#L81)。但是Pull會繼續處理完已經提交的計算。
所以,Buffer應該是主要用于進程內的通信和異步協調的一個類。
WaitAndSendIdsKernel這時候正在等待新一輪計算開始(https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/kernel/wait_and_send_ids_kernel.cpp#L40),結果收到Pull返回的kBufferStatusErrorClosed(https://github.com/Oneflow-Inc/oneflow/blob/49f60e682518436dfeb37344a15902a959e0e4f2/oneflow/core/common/buffer.h#L61)。
WaitAndSendIdsActor::IsCustomizedReadReady以后就一直返回false(https://github.com/Oneflow-Inc/oneflow/blob/22f70a1719f371a54512633bb92086580d9c3c89/oneflow/core/lazy/actor/wait_and_send_ids_actor.cpp#L68),IsReadReady也返回false(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L533)。
這之后,ActUntilFail只會執行異步消息發送(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L437)(不再進入while循環)
WaitAndSendIdsActor::HandlerNormal仍然會處理其它Actor發來的消息(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L340)。但因為IsCustomizedReadReady返回false,會進入AsyncSendEORDMsgForAllProducedRegstDesc(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L394)執行。它會給每個下游發送kEordMsg消息(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L614)。
Actor在收到上游發來的kEordMsg消息后,遞減remaining_eord_cnt_(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L331)。
remaining_eord_cnt_被初始化為Actor的輸入regst的數量(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L171)。
total_reading_cnt_是當前Actor生產的、已經發給consumer、但尚未收到ack的消息數量。
Actor目前仍可以正常接收consumer發來的ack消息。
當上述2個變量都為0時(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L395),意味著所有上游都發出了kEordMsg消息,也收到了所有下游的ack消息。Actor就給Thread返回1(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L397)。
如果上述兩個變量有不為0的,就修改handler,由HandlerZombie(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/lazy/actor/actor.cpp#L399)處理后續收到的消息。
Thread收到Actor返回的1后(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L84),將它從自己的存儲中刪除(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L89),并遞減運行Actor的數量。
9.2 Thread的退出
NNGraph重置runtime_導致運行時對象被析構(https://github.com/Oneflow-Inc/oneflow/blob/8f672eea116cae4a73bb7309e7496b08d7ec9a32/oneflow/core/framework/nn_graph.cpp#L83)。
Runtime刪除所有Thread(https://github.com/Oneflow-Inc/oneflow/blob/b17a9cd6b930b5817c63623fb682bd708377a93b/oneflow/core/job/runtime.cpp#L117)。
ThreadMgr給所有Thread發送kStopThread消息(https://github.com/Oneflow-Inc/oneflow/blob/c8c6d351fa28c5ebce948d69c06670a783f83f74/oneflow/core/thread/thread_manager.cpp#L64)。同時,重置指針導致Thread析構(https://github.com/Oneflow-Inc/oneflow/blob/c8c6d351fa28c5ebce948d69c06670a783f83f74/oneflow/core/thread/thread_manager.cpp#L66)。
Thread的物理線程退出PollMsgChannel循環(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L68)。
Thread等待物理線程結束,關閉channel(https://github.com/Oneflow-Inc/oneflow/blob/55b822e4d3c88757d11077d7546981309125c73f/oneflow/core/thread/thread.cpp#L52)。
10?
分布式場景的靜態圖
分布式的compile_job、物理圖Plan和單機場景有明顯變化。
比如,每個進程都有一套WaitAndSendIds等控制節點。這也容易理解,因為每個節點都要執行__run和Buffer::Push/Pull,都要啟動本進程的Actors執行計算。
matmul和broadcast_add等user op也會在兩個節點進行計算。 ?
10.1 示例代碼
啟動方式參考Global Tensor的官方文檔。
import oneflow as flowimport oneflow.nn as nnP0 = flow.placement("cpu", ranks=[0, 1])a0_sbp = flow.sbp.split(0)class ModuleMyLinear(nn.Module): def __init__(self, in_features, out_features): super().__init__() self.weight = nn.Parameter(flow.randn(in_features, out_features, placement=P0, sbp=flow.sbp.broadcast)) self.bias = nn.Parameter(flow.randn(1, out_features, placement=P0, sbp=flow.sbp.broadcast)) def forward(self, input): return flow.matmul(input, self.weight) + self.biaslinear_model = ModuleMyLinear(4, 3)class GraphMyLinear(nn.Graph): def __init__(self): super().__init__() # ModuleBlock self.model = linear_model def build(self, input): # ModuleBlock.__call__ return self.model(input)graph_mylinear = GraphMyLinear()input = flow.randn(5, 4, placement=P0, sbp=flow.sbp.split(1))out = graph_mylinear(input)print(out)
11?
附錄
11.1 斷點
11.1.1 Python斷點示例
# python3 -m pdb test.pybreak test.py:25break oneflow/nn/graph/graph.py:221break oneflow/nn/graph/graph.py:741break oneflow/nn/graph/graph.py:745break oneflow/nn/graph/graph.py:759break oneflow/nn/graph/graph.py:828break oneflow/nn/graph/graph.py:777break oneflow/nn/graph/graph.py:1066break oneflow/nn/graph/graph.py:1133break oneflow/framework/graph_build_util.py:227
11.1.2 C++斷點示例
啟動命令
source /mnt/oneflow/build/source.shgdb --args python3 /mnt/oneflow/test.py# set breakpoints# run
斷點示例
set breakpoint pending onbreak oneflow::ActorMsg::BuildEordMsgbreak oneflow/core/common/buffer.h:80break oneflow::(anonymous namespace)::CheckAndConstructOpbreak oneflow::WaitAndSendIdsActor::Actbreak oneflow::WaitAndSendIdsActor::HandlerWaitToStartbreak oneflow/core/lazy/actor/light_actor.cpp:452break oneflow/core/lazy/actor/light_actor.cpp:485break oneflow::ForeignInputKernel::ForwardDataContentbreak oneflow::vm::LaunchLazyJobInstructionType::Compute
11.2 靜態圖的json表示
forward(https://quip.com/OMc4A0HOOr0C)
full(https://quip.com/JLaMAHGBLXmK)
compiled(https://quip.com/tXjuAiS3J0Ab)
plan(https://quip.com/a0DMAAIte6PQ)
11.3 actor type
naive_actor
System-AutoTick-AppendDeviceTick_9System-AutoTick-DstSubsetTick_12System-AutoTick-DstSubsetTick_21System-AutoTick-DstSubsetTick_27System-AutoTick-Prepend-DeviceTick_7System-AutoTick-SrcSubsetTick_20System-AutoTick-SrcSubsetTick_26System-AutoTick-SrcSubsetTick_8System-AutoTick-Tick_11System-AutoTick-Tick_13System-EagerCriticalSection-Callback-23System-EagerCriticalSection-Callback-29System-EagerCriticalSection-Interface-Begin-Tick-18System-EagerCriticalSection-Interface-Begin-Tick-24System-EagerCriticalSection-Interface-End-Tick-19System-EagerCriticalSection-Interface-End-Tick-25System-EagerCriticalSection-Wait-22System-EagerCriticalSection-Wait-28
light_actor
_GraphMyLinear_0_input.0.0_2_GraphMyLinear_0_output.0.0_2model.biasmodel-broadcast_add-1model-matmul-0model.weightSystem-AutoTick-SinkTick_15System-SyncAllRanksSinkTick_14
wait_and_send_ids_actor
???System-Src-WaitAndSendIds_16
call_back_notify_actor
???System-Sink-CallbackNotify_17
12?
參考資料
oneflow v0.8.0(https://github.com/Oneflow-Inc/oneflow/tree/release/v0.8.0)
OneFlow框架的系統設計(上篇)(https://zhuanlan.zhihu.com/p/337851255)
OneFlow框架的系統設計(中篇)(https://zhuanlan.zhihu.com/p/338699487)
OneFlow框架的系統設計(下篇)(https://zhuanlan.zhihu.com/p/339208452)
一個Job在OneFlow中的執行過程—上篇(https://zhuanlan.zhihu.com/p/344531540)
一個Job在OneFlow中的執行過程—中篇(https://zhuanlan.zhihu.com/p/355654002)
一個Job在OneFlow中的執行過程—下篇(https://zhuanlan.zhihu.com/p/363689736)
靜態圖模塊 nn.Graph(https://docs.oneflow.org/master/basics/08_nn_graph.html)
OneFlow系統設計(https://docs.oneflow.org/v0.4.0/basics_topics/essentials_of_oneflow.html)
torch.nn.Module(https://pytorch.org/docs/1.10/generated/torch.nn.Module.html)
其他人都在看
OneFlow源碼解析:自動微分機制
ChatGPT的一小步,NLP范式轉變的一大步
李白:你的模型權重很不錯,可惜被我沒收了
OpenAI掌門Sam Altman:AI下一個發展階段
32篇年度最佳AI論文;Python編譯器Codon開源
比快更快,開源Stable Diffusion刷新作圖速度
OneEmbedding:單卡訓練TB級推薦模型不是夢
關鍵詞: