Support slice for device code
Revival of https://github.com/rust-accel/accel/pull/16
Objective
----------
Slice patterns in the kernel arguments become to work, and make it safe function
```rust
#[kernel]
pub fn add(a: &[f64], b: &[f64], c: &mut [f64]) {
let i = accel_core::index() as usize;
if i < c.len() {
c[i] = a[i] + b[i];
}
}
```
Semantics
-----------
`a`, `b`, `c` and other slices are passed as a element length (in size, not byte) and a pointer of actual data. Pointer must be one of following:
- a device pointer (i.e. casted from `CUdeviceptr`)
- a host pointer which can be accessed from device directly.
- e.g. a host memory allocated by [cuMemAllocHost](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1gdd8311286d2c2691605362c689bc64e0)
TODO
-----
- [x] Add custom trait for device sendable to change its behavior for primitive types / slice
- `DeviceSend` trait in !41
- [ ] Implement `DeviceSend` to `&[T]` and `&mut [T]`
- [ ] Test slice can be access from device code
issue