Heterogeneous computing
Support for heterogeneous computing is currently being worked on.
The use of Adapt.jl
Adapt.jl is a package in the JuliaGPU family that allows for the translation of nested data structures. The primary goal is to allow the substitution of Array
at the storage level with a GPU array like CuArray
from CUDA.jl.
To facilitate this, data structures must be parameterized, so instead of:
struct Container <: Trixi.AbstractContainer
data::Array{Float64, 2}
end
They must be written as:
struct Container{D<:AbstractArray} <: Trixi.AbstractContainer
data::D
end
furthermore, we need to define a function that allows for the conversion of storage of our types:
using Adapt
function Adapt.adapt_structure(to, C::Container)
return Container(adapt(to, C.data))
end
or simply
Adapt.@adapt_structure(Container)
additionally, we must define Adapt.parent_type
.
function Adapt.parent_type(::Type{<:Container{D}}) where D
return D
end
All together we can use this machinery to perform conversions of a container.
julia> C = Container(zeros(3))
Container{Vector{Float64}}([0.0, 0.0, 0.0])
julia> Trixi.storage_type(C)
Array
julia> using CUDA
julia> GPU_C = adapt(CuArray, C)
Container{CuArray{Float64, 1, CUDA.DeviceMemory}}([0.0, 0.0, 0.0])
julia> Trixi.storage_type(C)
CuArray
Element-type conversion with Trixi.trixi_adapt
.
We can use Trixi.trixi_adapt
to perform both an element-type and a storage-type adoption:
julia> C = Container(zeros(3))
Container{Vector{Float64}}([0.0, 0.0, 0.0])
julia> Trixi.trixi_adapt(Array, Float32, C)
Container{Vector{Float32}}(Float32[0.0, 0.0, 0.0])
julia> Trixi.trixi_adapt(CuArray, Float32, C)
Container{CuArray{Float32, 1, CUDA.DeviceMemory}}(Float32[0.0, 0.0, 0.0])
adapt(Array{Float32}, C)
is tempting, but it will do the wrong thing in the presence of SVector
s and similar arrays from StaticArrays.jl.
Writing GPU kernels
Offloading computations to the GPU is done with KernelAbstractions.jl, allowing for vendor-agnostic GPU code.
Example
Given the following Trixi.jl code, which would typically be called from within rhs!
:
function trixi_rhs_fct(mesh, equations, solver, cache, args)
@threaded for element in eachelement(solver, cache)
# code
end
end
Put the inner code in a new function
rhs_fct_per_element
. Besides the indexelement
, pass all required fields as arguments, but make sure to@unpack
them from their structs in advance.Where
trixi_rhs_fct
is called, get the backend, i.e., the hardware we are currently running on viatrixi_backend(x)
. This will, e.g., work withu_ode
. Internally, KernelAbstractions.jl'sget_backend
will be called, i.e., KernelAbstractions.jl has to know the type ofx
.julia backend = trixi_backend(u_ode)
Add a new argument
backend
totrixi_rhs_fct
used for dispatch. Whenbackend
isnothing
, the legacy implementation should be used:julia function trixi_rhs_fct(backend::Nothing, mesh, equations, solver, cache, args) @unpack unpacked_args = cache @threaded for element in eachelement(solver, cache) rhs_fct_per_element(element, unpacked_args, args) end end
When
backend
is aBackend
(a type defined by KernelAbstractions.jl), write a KernelAbstractions.jl kernel: ```julia function trixirhsfct(backend::Backend, mesh, equations, solver, cache, args) nelements(solver, cache) == 0 && return nothing # return early when there are no elements @unpack unpackedargs = cache kernel! = rhsfctkernel!(backend) kernel!(unpackedargs, args, ndrange = nelements(solver, cache)) return nothing end@kernel function rhsfctkernel!(unpackedargs, args) element = @index(Global) rhsfctperelement(element, unpacked_args, args) end ```