PaddlePaddle Fluid is targeting the autodiff without tape, which, however, is very
challenging and we are still way from there. DyNet and PyTorch provide a good design
idea, the tape, that significantly eases the challenge. Also, DyNet provides
a C++ API that is as convenient as Python but with higher efficiency and could
conveniently integrate with industrial/production systems. This package, tape,
combines the good of
tape from PyTorch and DyNet
C++ API and core from DyNet
rich set of operators from PaddlePaddle
Overview
We can implement Dynet-like Tape(See this survey)
by wrapping Paddle Fluid’s Operator and Variable.
The user API is straight forward since
it is imperative. And it uses host language’s control flow logic.
it avoids extra concepts such as Scope and Executor.
All of these benefits come at the cost of just adding one line reset_global_tape
at every iteration.
Code Structure
In short, the Tape contains a vector of OpHandles. And an OpHandle contains its
type, the pointers to the Variables, and necessary attributes.
class Variable {
public:
VriableHandle Grad(); // returns its gradient variable
private:
framework::VarDesc desc_; // compile time infershape, necessary for lazy execution
framework::Variable var_; // run time variable, holds data memory
};
using VariableHandle = shared_ptr<Variable>;
struct OpHandle {
string type_;
map<string, vector<VariableHandle>> inputs_;
map<string, vector<VariableHandle>> outputs_;
AttributeMap attrs_;
};
class Tape {
public:
void AddOp(OpHandle); // add op
void Forward(); // execute the tape_
void Backward(); // execute the backward of the tape_
private:
vector<OpHandle> tape_;
};
We uses Function to indicate layers. It takes care of parameter
initialization and AddOp to the Tape when it is called.
// Model function
paddle::tape::Linear linear1(3, 3, "relu"); // init weight and bias
paddle::tape::Linear linear2(3, 3, "relu"); // init weight and bias
paddle::tape::Mean mean;
// Optimizer
paddle::tape::SGD sgd(0.001);
// Data Feeder
paddle::tape::Fill data_feeder(...);
VariableHandle input(new paddle::tape::Variable("input"));
VariableHandle label(new paddle::tape::Variable("label"));
for (int i = 0; i < 2; ++i) {
reset_global_tape();
data_feeder(input, label);
auto loss = softmax(linear2(linear1(input)), label); // compile time InferShape & InferVarType
LOG(INFO) << loss.value(); // Run forward up to loss
// Run backward, store gradient of w at w->Grad()
get_global_tape.Backward(loss);
// Update w
sgd(linear1.Params());
sgd(linear2.Params());
}
We want to stay close to Paddle Fluid as much as possible.
Reuse All Operators
As all Ops are registered at OpInfoMap, the effort of adding a new Function
is about 10 lines of code, similar to expose an operator to Python.
Reuse Compile Time InferShape and InferVarType
Note that all the symbolic information is stored at tape::Varaible::desc_, instead
of ProgramDesc.block.vars, we create a temporary BlockDesc to do InferShape and
InferVarType every time we AddOp to the tape.
Reuse Operator::Run
We use smart pointer, instead of Scope, to manage memory. So we create a temporary
Scope for every Operator::Run().
Possible Feature
Release Memory on Backward
We can release memory aggressively. During backward, we can delete the OpHandle once
we have finished its backward. Since all the variable is managed by smart pointer, the
memory is automatically released when its ref_count goes to 0.
Kernel Fusion
As a symbolic representation of the Tape is constructed first before the actual
execution, it would be possible to perform graph optimization. One use case is kernel
fusion.
Dynamic Graph on Fluid
PaddlePaddle Fluid is targeting the autodiff without tape, which, however, is very challenging and we are still way from there. DyNet and PyTorch provide a good design idea, the tape, that significantly eases the challenge. Also, DyNet provides a C++ API that is as convenient as Python but with higher efficiency and could conveniently integrate with industrial/production systems. This package,
tape, combines the good ofOverview
We can implement Dynet-like Tape(See this survey) by wrapping Paddle Fluid’s
OperatorandVariable.The user API is straight forward since
ScopeandExecutor.All of these benefits come at the cost of just adding one line
reset_global_tapeat every iteration.Code Structure
In short, the
Tapecontains a vector ofOpHandles. And anOpHandlecontains itstype, the pointers to theVariables, and necessary attributes.We uses
Functionto indicate layers. It takes care of parameter initialization andAddOpto the Tape when it is called.User API
digraph G {
}
Code Reuse
We want to stay close to Paddle Fluid as much as possible.
Reuse All Operators
As all Ops are registered at
OpInfoMap, the effort of adding a newFunctionis about 10 lines of code, similar to expose an operator to Python.Reuse Compile Time InferShape and InferVarType
Note that all the symbolic information is stored at
tape::Varaible::desc_, instead ofProgramDesc.block.vars, we create a temporaryBlockDescto doInferShapeandInferVarTypeevery time weAddOpto the tape.Reuse Operator::Run
We use smart pointer, instead of
Scope, to manage memory. So we create a temporaryScopefor everyOperator::Run().Possible Feature
Release Memory on Backward
We can release memory aggressively. During backward, we can delete the OpHandle once we have finished its backward. Since all the variable is managed by smart pointer, the memory is automatically released when its
ref_countgoes to 0.Kernel Fusion
As a symbolic representation of the Tape is constructed first before the actual execution, it would be possible to perform graph optimization. One use case is kernel fusion.