Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

Learning PyTorch is easy, but can you learn the internals of PyTorch? Recently, Christian, who has 14 years of ML experience, introduced the kernel mechanism of PyTorch. Although this knowledge is not required in actual use, exploring the PyTorch kernel can greatly improve our intuition and understanding of the code, and it is the great gods who dig the underlying implementation~

The builders of PyTorch show that the philosophy of Pytorch is to solve the imperative, which is to build and run our computational graphs on the fly. This is just right for Python’s programming philosophy, which can be run in Jupyter Notebook as soon as it is defined. Therefore, PyTorch’s workflow is very close to Python’s scientific computing library NumPy.

Christian shows that a lot of what makes PyTorch so convenient is because of its “genes” – the inner workings. This report does not describe how to use PyTorch basic modules, or how to train a neural network with PyTorch. Christian focuses on how to introduce the core mechanism of PyTorch in an intuitive form, that is, how each module works.

Christian said on Reddit that this report could not upload the speech video due to the video recording problem, so he can only share the speech PPT for the time being. But Christian will be giving another talk on this topic recently, so we can look forward to the next video introducing PyTorch.

Speech PPT address: https://speakerdeck.com/perone/pytorch-under-the-hood

Baidu cloud address: https://pan.baidu.com/s/1aaE0I1geF7VwEnQRwmzBtA

The main agenda of this talk is as follows, which mainly introduces the underlying operating mechanism from the perspective of tensors and the JIT compiler:

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

Before discussing the mechanism of each component of PyTorch, we need to understand the overall workflow. PyTorch uses a paradigm called imperative/eager, where each line of code requires building a graph to define a part of the complete computational graph. Even if the complete computational graph has not been constructed, we can execute these small computational graphs as components independently. This dynamic computational graph is called a “define-by-run” approach.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

In fact, beginners can learn to use after understanding the overall process, but the underlying mechanism helps to understand and control the code.

Tensor

Conceptually, a tensor is a generalization of vectors and matrices, and a tensor in PyTorch is a multidimensional matrix whose elements are of the same data type. Although the interface of PyTorch is Python, the bottom layer is mainly implemented in C++, and in Python, integrating C++ code is usually called “extension”.

Because tensors mainly carry data and perform calculations. PyTorch’s tensor calculation uses the lowest-level and basic tensor operation library ATen, and its automatic differentiation uses Autograd, which is also built on the ATen framework.

Python object

In order to define a new Python object type in C/C++, you need to define the following THPVariable-like structure. The first of these, the PyObject_HEAD macro, aims to normalize Python objects and expands to another structure that contains a pointer to an object of type, and a field with a ref count.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

There are two additional macros in the Python API, called Py_INCREF() and Py_DECREF(), that can be used to increment and decrement the reference count of Python objects.

In PyThon, everything is an object, such as variables, data structures, and functions.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

ZERO-COPYING Tensor

Since the use of Numpy arrays is so common, we do need to convert between Numpy and PyTorch tensors. So PyTorch provides two methods, from_numpy() and numpy(), to convert between NumPy arrays and PyTorch tensors.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

Because the cost of tensor storage is relatively large, if we copy the data during the above conversion process, the memory usage will be very large. One advantage of a PyTorch tensor is that it keeps a pointer to the internal NumPy array instead of copying it directly. This means that PyTorch will own this data and share the same memory area as the NumPy array object.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

The form of Zero-Copying does save a lot of memory, but the distinction between in-place and standard operations can be a bit blurry as shown above. If you use np_array = np_array +1.0, the memory of torch_array will not change, but if you use np_array += 1.0, the memory of torch_array will change.

CPU/GPU memory allocation

The actual raw data of a tensor is not immediately stored in the tensor structure, but in what we call “Storage”, which is part of the tensor structure. Generally, tensor storage can be stored in computer memory (CPU) or video memory (GPU) through Allocator.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

THE BIG PICTURE

Finally, the PyTorch assertion THTensor structure can be shown as the following figure. The main structure of THTensor is tensor data, which retains information such as size/strides/dimensions/offsets/, and also stores THStorage.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

JIT

Because PyTorch is in just-in-time mode, it means it’s easy to debug or inspect code, etc. In PyTorch 1.0, it first introduced torch.jit, a set of compilation tools whose main goal is to bridge the gap between research and production deployment. The JIT includes a language called Torch Script, which is a sublanguage of Python. Code using Torch Script can be very heavily optimized and can be serialized for use in subsequent C++ APIs.

The Eager mode that is commonly used to run with Python is shown below, and the Script mode can also be run. Eager mode is suitable for prototyping and experimentation, while Script mode is suitable for optimization and deployment.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

So why use TORCHSCRIPT? Christian gives the following reasons:

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

PyTorch JIT main process

As shown below, JIT will mainly input code or Python’s Abstract Syntax Tree (AST), where AST will use a tree structure to represent the syntactic structure of Python source code. Parsing may be parsing syntactic structures and computational graphs, then syntax detection followed by code optimization, and finally just compile and execute.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

Where optimizations can be used to model computational graphs, such as unrolling loops, etc. In Peephole optimization as shown below, the compiler improves performance only for the generated code in one or more basic blocks, combining the characteristics of CPU instructions and some transformation rules. Peephole optimization can also improve code performance through overall analysis and instruction translation.

The twice means of the matrix shown below is equal to the matrix itself, which should be optimized.

Want to read the underlying code of PyTorch?This introduction to the kernel mechanism is for you

implement

Just as the Python interpreter can execute code, PyTorch also has an interpreter that executes intermediate representation instructions during the JIT process:

Finally, Christian also introduced many internal operating mechanisms, but because they are difficult, and there is no video explanation for the time being, readers can look at the specific PPT content.

The Links:   LM64C20P DMF50773-NF-FW