Post

Gt Dl L8

Readings

None!

Scaling Deep Learning from Experiment to Production

Pytorch offers some tools such as:

1
2
3
4
5
6
import torch
from torch import autograd

torch.autograd.profiler.profile
torch.utils.bottleneck # Summarize cPython and autograd profilers
torch.autograd.detect_anomaly
  • autograd profiler
    • shows various performance metric
  • Bottle neck
    • summarize both the autograd and c python profilers
  • Autograd detect anomaly
    • returns as soon as a nan is detected during computations

In pytorch there are two modes:

  • Eager mode
    • Can type commands which are executed immediately.
    • This enables simple debuggable python programs prototyping deep learning code.
  • Script mode
    • Programs are converted and ran by a lean just-in-time (JIT) interpreter.
      • This is useful for running in Production

Just in time (JIT) can do various types of optimization:

  • Algebraic rewriting: Constant folding, common subexpression elimination, dead code elimination, loop unrolling, etc.
  • Out-of-order execution: Re-ordering operations to reduce memory pressure and make efficient use of cache locality
  • Kernel fusion: Combining several operators into a single kernel to avoid per-op overhead
  • Target-dependent code generation: Compiling parts of the program for specific hardware. Integration also ongoing with TVM, Halide, Glow, XLA
  • Runtime: No python global interpreter lock. Fork and wait parallelism

Behind the scenes, PyTorch uses different optimized C++ libraries for CPU and GPU. We can extend the library by adding custom C++ operations that can be added in a way similar to pybind11. But using custom C++ operations has the advantage of providing jitability.

Ingest Data

The DataLoader is compatible with two standard python data representations.

  • The iterable data set
    • Provides each data point one at a time through an iteration process
    • support streams or storage when it comes to one data point at a time.
  • The map style data set
    • Provides a map interface that allows us to access items in any order.
    • Because we know the lane, we can also sample randomly because of the flexibility these obstruction provide.
    • Because of the flexibility, users often insert preprocessing or data augmentations directly into data set methods.

For more information, refer to the documentation.

Note, you can pin_memory - that is for the data to be stored in the ram.

  • Copy from host to GPU is faster from RAM directly. To prevent paging, pin tensor to page-locked RAM.
  • Once a tensor is pinned, use asynchronous GPU copies with to(device, non_blocking=True) to overlap data transfers with computation.
  • Script mode: TorchScript A single Python process can saturate multiple GPUs, even with the global interpreter lock.

Use Multiple GPUs and Machines

There are two different types of parallel

  • Data parallel
    • data is distributed across devices
  • Model parallel
    • model is distributed across devices

Look up the necessary documentation depending on which type of parallelism you require.

Distributed Model Parallel

The case for completely distributed model parallel requires more flexibility - we want to share the model across GPUs living on different machines.

RPC is remote procedure call and on top of it lives the reference counting protocol, remote objects to access, distributed autograd and distributed optimizers.

Again, look up the required documentation if you need to do so (unlikely in this course) based on each of the following scenarios:

  • Single-device training: if the data and model can fit in one GPU, and training speed is not a concern
  • Single-machine multi-GPU DataParallel: if there are multiple GPUs on server, and want to speed up training with minimum code change
  • Single-machine multi-GPU Distributed Data Parallel: for further speed up in training if willing to write a little more code
  • Multi-Machine Distributed Data Parallel: to scale across machine boundaries
  • TorchElastic: if errors are expected or resources can join and leave dynamically
This post is licensed under CC BY 4.0 by the author.