GPU Support
GPU support is still an experimental feature that is actively being worked on. Currently, the WeaklyCompressibleSPHSystem
, TotalLagrangianSPHSystem
and BoundarySPHSystem
support GPU execution. We have tested GPU support on Nvidia, AMD and Apple GPUs. Note that most Apple GPUs do not support Float64
. See below on how to run single precision simulations.
To run a simulation on a GPU, use the FullGridCellList
as cell list for the GridNeighborhoodSearch
. Unlike the default cell list, which assumes an unbounded domain, this cell list requires a bounding box for the domain. For simulations that are bounded by a closed tank, we can simply use the boundary of the tank to obtain the bounding box as follows.
min_corner = minimum(tank.boundary.coordinates, dims=2)
max_corner = maximum(tank.boundary.coordinates, dims=2)
cell_list = FullGridCellList(; min_corner, max_corner)
We then need to pass this cell list to the neighborhood search and the neighborhood search to the Semidiscretization
.
semi = Semidiscretization(fluid_system, boundary_system,
neighborhood_search=GridNeighborhoodSearch{2}(; cell_list))
At this point, we should run the simulation and make sure that it still works and that the bounding box is large enough. For some simulations where particles move outside the initial tank coordinates, for example when the tank is not closed or when the tank is moving, an appropriate bounding box has to be specified.
Then, we only need to specify the data type that is used for the simulation. On an Nvidia GPU, we specify:
using CUDA
ode = semidiscretize(semi, tspan, data_type=CuArray)
On an AMD GPU, we use:
using AMDGPU
ode = semidiscretize(semi, tspan, data_type=ROCArray)
Now, we can run the simulation as usual. All data is transferred to the GPU during initialization and all loops over particles and their neighbors will be executed on the GPU as kernels generated by KernelAbstractions.jl. Data is only copied to the CPU for saving VTK files via the SolutionSavingCallback
.
Run an existing example file on the GPU
The example file examples/fluid/dam_break_2d_gpu.jl
demonstrates how to run an existing example file on a GPU. It first loads the variables from examples/fluid/dam_break_2d.jl
without executing the simulation. This is achieved by overwriting the line that starts the simulation with trixi_include(..., sol=nothing)
. Next, a GPU-compatible neighborhood search is defined, and the original example file is included with the new neighborhood search. This requires the assignments neighborhood_search = ...
and data_type = ...
to be present in the original example file. Note that in examples/fluid/dam_break_2d.jl
, we specifically set data_type=nothing
, even though this is the default value, so that we can use trixi_include
to replace this value.
To run this simulation on a GPU, simply update data_type
to match the array type of the installed GPU. We can run this simulation on an Nvidia GPU as follows.
using CUDA
trixi_include(joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"), data_type=CuArray)
For AMD GPUs, use
using AMDGPU
trixi_include(joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"), data_type=ROCArray)
For Apple GPUs (which don't support double precision, see below), use
using Metal
trixi_include_changeprecision(Float32,
joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"),
data_type=MtlArray)
Single precision simulations
All GPU-supported features can also be used with single precision, which is significantly faster on most GPUs and required for many Apple GPUs.
To run a simulation with single precision, all Float64
literals in an example file must be converted to Float32
(e.g. 0.0
to 0.0f0
). TrixiParticles provides a function to automate this conversion:
TrixiBase.trixi_include_changeprecision
— Functiontrixi_include_changeprecision(T, [mod::Module=Main,] elixir::AbstractString; kwargs...)
include
the elixir elixir
and evaluate its content in the global scope of module mod
. You can override specific assignments in elixir
by supplying keyword arguments, similar to trixi_include
.
The only difference to trixi_include
is that the precision of floating-point numbers in the included elixir is changed to T
. More precisely, the package ChangePrecision.jl is used to convert all Float64
literals, operations like /
that produce Float64
results, and functions like ones
that return Float64
arrays by default, to the desired type T
. See the documentation of ChangePrecision.jl for more details.
The purpose of this function is to conveniently run a full simulation with Float32
, which is orders of magnitude faster on most GPUs than Float64
, by just including the elixir with trixi_include_changeprecision(Float32, elixir)
. Many constructors in the Trixi.jl framework are written in a way that changing all floating-point arguments to Float32
will change the element type to Float32
as well. In TrixiParticles.jl, including an elixir with this macro should be sufficient to run the full simulation with single precision.
To run the previous example with single precision, use the following:
using CUDA
trixi_include_changeprecision(Float32,
joinpath(examples_dir(), "fluid", "dam_break_2d_gpu.jl"),
data_type=CuArray)