Learning PyTorch is easy, but can you learn the internals of PyTorch? Recently, Christian, who has 14 years of ML experience, introduced the kernel mechanism of PyTorch. Although this knowledge is not required in actual use, exploring the PyTorch kernel can greatly improve our intuition and understanding of the code, and it is the great gods who dig the underlying implementation~
The builders of PyTorch show that the philosophy of Pytorch is to solve the imperative, which is to build and run our computational graphs on the fly. This is just right for Python’s programming philosophy, which can be run in Jupyter Notebook as soon as it is defined. Therefore, PyTorch’s workflow is very close to Python’s scientific computing library NumPy.
Christian shows that a lot of what makes PyTorch so convenient is because of its “genes” – the inner workings. This report does not describe how to use PyTorch basic modules, or how to train a neural network with PyTorch. Christian focuses on how to introduce the core mechanism of PyTorch in an intuitive form, that is, how each module works.
Christian said on Reddit that this report could not upload the speech video due to the video recording problem, so he can only share the speech PPT for the time being. But Christian will be giving another talk on this topic recently, so we can look forward to the next video introducing PyTorch.
Speech PPT address: https://speakerdeck.com/perone/pytorch-under-the-hood
Baidu cloud address: https://pan.baidu.com/s/1aaE0I1geF7VwEnQRwmzBtA
The main agenda of this talk is as follows, which mainly introduces the underlying operating mechanism from the perspective of tensors and the JIT compiler:
Before discussing the mechanism of each component of PyTorch, we need to understand the overall workflow. PyTorch uses a paradigm called imperative/eager, where each line of code requires building a graph to define a part of the complete computational graph. Even if the complete computational graph has not been constructed, we can execute these small computational graphs as components independently. This dynamic computational graph is called a “define-by-run” approach.
In fact, beginners can learn to use after understanding the overall process, but the underlying mechanism helps to understand and control the code.
Conceptually, a tensor is a generalization of vectors and matrices, and a tensor in PyTorch is a multidimensional matrix whose elements are of the same data type. Although the interface of PyTorch is Python, the bottom layer is mainly implemented in C++, and in Python, integrating C++ code is usually called “extension”.
Because tensors mainly carry data and perform calculations. PyTorch’s tensor calculation uses the lowest-level and basic tensor operation library ATen, and its automatic differentiation uses Autograd, which is also built on the ATen framework.
In order to define a new Python object type in C/C++, you need to define the following THPVariable-like structure. The first of these, the PyObject_HEAD macro, aims to normalize Python objects and expands to another structure that contains a pointer to an object of type, and a field with a ref count.
There are two additional macros in the Python API, called Py_INCREF() and Py_DECREF(), that can be used to increment and decrement the reference count of Python objects.
In PyThon, everything is an object, such as variables, data structures, and functions.
Since the use of Numpy arrays is so common, we do need to convert between Numpy and PyTorch tensors. So PyTorch provides two methods, from_numpy() and numpy(), to convert between NumPy arrays and PyTorch tensors.
Because the cost of tensor storage is relatively large, if we copy the data during the above conversion process, the memory usage will be very large. One advantage of a PyTorch tensor is that it keeps a pointer to the internal NumPy array instead of copying it directly. This means that PyTorch will own this data and share the same memory area as the NumPy array object.
The form of Zero-Copying does save a lot of memory, but the distinction between in-place and standard operations can be a bit blurry as shown above. If you use np_array = np_array +1.0, the memory of torch_array will not change, but if you use np_array += 1.0, the memory of torch_array will change.
CPU/GPU memory allocation
The actual raw data of a tensor is not immediately stored in the tensor structure, but in what we call “Storage”, which is part of the tensor structure. Generally, tensor storage can be stored in computer memory (CPU) or video memory (GPU) through Allocator.
THE BIG PICTURE
Finally, the PyTorch assertion THTensor structure can be shown as the following figure. The main structure of THTensor is tensor data, which retains information such as size/strides/dimensions/offsets/, and also stores THStorage.
Because PyTorch is in just-in-time mode, it means it’s easy to debug or inspect code, etc. In PyTorch 1.0, it first introduced torch.jit, a set of compilation tools whose main goal is to bridge the gap between research and production deployment. The JIT includes a language called Torch Script, which is a sublanguage of Python. Code using Torch Script can be very heavily optimized and can be serialized for use in subsequent C++ APIs.
The Eager mode that is commonly used to run with Python is shown below, and the Script mode can also be run. Eager mode is suitable for prototyping and experimentation, while Script mode is suitable for optimization and deployment.
So why use TORCHSCRIPT? Christian gives the following reasons:
PyTorch JIT main process
As shown below, JIT will mainly input code or Python’s Abstract Syntax Tree (AST), where AST will use a tree structure to represent the syntactic structure of Python source code. Parsing may be parsing syntactic structures and computational graphs, then syntax detection followed by code optimization, and finally just compile and execute.
Where optimizations can be used to model computational graphs, such as unrolling loops, etc. In Peephole optimization as shown below, the compiler improves performance only for the generated code in one or more basic blocks, combining the characteristics of CPU instructions and some transformation rules. Peephole optimization can also improve code performance through overall analysis and instruction translation.
The twice means of the matrix shown below is equal to the matrix itself, which should be optimized.
Just as the Python interpreter can execute code, PyTorch also has an interpreter that executes intermediate representation instructions during the JIT process:
Finally, Christian also introduced many internal operating mechanisms, but because they are difficult, and there is no video explanation for the time being, readers can look at the specific PPT content.
The Links: LM64C20P DMF50773-NF-FW