% % colors: % _blue_text text_ % _red_text text_ % ==== ====[plain] %%\transdissolve <[center] <[columns] [[[.3\textwidth]]] <<>> [[[.3\textwidth]]] <<>> [[[.3\textwidth]]] <<>> [columns]> [center]> ==== Samba... ==== <[center] <<>> [center]> ==== Short History ==== * 1.9.17: 1996/08 * 2.0: 1999/01: domain-member, +SWAT * 2.2: 2001/04: NT4-DC * 3.0: 2003/09: AD-member, Samba4 project started * 3.2: 2008/07: GPLv3, experimental clustering * 3.3: 2009/01: clustering * 3.4: 2009/07: merged S3+S4 code * 3.5: 2010/03: experimental SMB 2.0 * 3.6: 2011/09: SMB 2.0 * 4.0: 2012/12: AD/DC, SMB 2.0 durable handles, 2.1, 3.0 * 4.1: 2013/10: stability * 4.2: soon: AD trusts, performance, scalability, CTDB included ==== Release Stream ==== <[center] <<>> [center]> ==== ====[plain] %%\transdissolve <<>> ==== Samba File Server Topics / Challenges ==== * scalable file server / performance ** scale-up: exhaust powerful boxes ** scale-out: flexible all-active clusters ** scale-down: perform well on low-end boxes * server workloads / SMB features ** small \# of connections, threaded applications ** Hyper-V, ... ** SMB3 (clustering, RDMA, ...) * interop / multi-protocol access (nfs, afp, ...) * special file systems support (gluster, ceph, gpfs, btrfs, ...) * cloud / openstack?... %* (samba $\leftrightarrow$ cifs.ko alternative to nfs?...) %% ==== Samba File Serving Topics ==== %% %% * Performance %% * Clustering (CTDB) %% * SMB features (SMB3...) %% * Interop (protocols, NFS, AFP, ...) %% * special file systems support (gluster, ceph, gpfs, btrfs...) %% * ... %%==== Other Samba Topics ==== %% %%* Auth/Domain Member %%* RPC server %%* AD Sever %%* ... ==== Performance - low end systems ==== Reduction of CPU usage for low profile platforms like arm (SMB2) * Samba 4.0: ** didn't saturate 1G nic (arm), CPU 100\% * reduced memory allocations * instrument SMB 2.1 multi-credit / large MTU * Samba 4.2: ** saturates 1G nic (arm), CPU $<$ 100\% ==== Performance - DB performance ==== <[block]{TDB} * trivial database * used for IPC (smbd processes) * cluster (CTDB): local copies [block]> <[block]{hot databases} * @locking.tdb@ (open files) * @brlock.tdb@ (byte range locks) * @notify\_index.tdb@ (for change notify) [block]> ==== Performance - DB performance ==== <[block]{problem 1} * fcntl bty range locks for record locks * contention via single kernel spinlock [block]> <[block]{solution} * alternative to fcntl: pthread robust mutexes * ==> massive speedup * ==> included in TDB 1.3.1, Samba 4.2 [block]> ==== Performance - DB performance ==== <[block]{problem 2} * freelist: ** single chain, contended (@locking.tdb@) ** gets fragmented (singly linked) * especially a problem in ctdb-cluster: vacuuming [block]> <[block]{improvements} * make use of small per-record freelists (dead records) * add automatic defragmentation upon traversal * ==> included in TDB 1.3.1, Samba 4.2 [block]> ==== Performance - DB performance ==== <[block]{problem 3} * change notify not scalable [block]> <[block]{first improvement} * restructured @notify.tdb@ to ** global @notify\_index.tdb@ and ** local @notify.tdb@ ** ==> better but still not good enough for some workloads [block]> <[block]{next steps} * replace DB-approach by new scalable, async notify daemon using messaging * some false positives do not harm * ==> TODO [block]> ==== Performance - scaling ==== <[block]{parellelism} * samba is multi-process: ** smbd child process $\leftrightarrow$ TCP connection ** event-loop in one process * within a smbd process: ** pthread-pool jobs for potentially blocking syscalls ** ==> parallelism for reads/writes ** default for async I/O since Samba 4.0 [block]> ==== Performance - scaling ==== <[block]{messaging} * classical messaging: ** messages.tdb and signals between processes ** does not scale well * new massaging in Samba 4.2: ** fast and scalable messaging based on unix datagram messages ** ==> WIP: integrate with AD/DC messaging ** ==> features fd-passing for sockets (SMB3 multi-channel) ** ==> TODO: integrate into CTDB inter-node-messaging [block]> ==== Interop ==== * multi-protocol access ** nfs (kernel, ganesha, ...) ** afp: netatalk * client-specific ** apple: *** @vfs\_fruit@ *** spotlight *** AAPL ** SMB2+ unix-extensions ==== ====[plain] old ==== File Server Layout/Scope ==== <[center] <<>> [center]> ==== SMB Features ==== * SMB 2.0: ** durable file handles [4.0] * SMB 2.1: ** multi-credit / large mtu [4.0] ** dynamic reauthentication [4.0] ** leasing [WIP++] ** resilient file handles [ever?] * SMB 3.0: ** new crypto (sign/encrypt) [4.0] ** secure negotiation [4.0] ** durable handles v2 [4.0] ** persistent file handles [planning] ** multi-channel [WIP+] ** SMB direct [designed/starting] ** cluster features [designing] *** witness [WIP] ** storage features [WIP] * SMB 3.1: [WIP] ==== Clusterd Samba / CTDB (SOFS since 2007) ==== <[center] <<>> [center]> %%% === SMB 3.0 ==== %%% %%% \transdissolve %%% %%% +<2->{ %%% * new crypto (signing, transport encryption) %%% * persistent file handles %%% * multi-channel %%% * RDMA transport (SMB direct) %%% * storage features %%% * clustering %%% ** witness %%% ** transparent failover (continuous availability) %%% ** all-active (scale-out) %%% } %%% %%% ==== SMB3 - Goals ==== %%% %%% \transdissolve %%% %%% +<2->{ %%% * fault tolerance / reliability %%% * performance / throughput / scaling %%% * focus on support for server workloads \\ % %%% (as opposed to workstation workloads) %%% * especially support for: %%% ** Hyper-V %%% ** MS-SQL %%% * goals: %%% ** replace block storage in data center %%% ** block (SCSI) over SMB %%% } %%% %%% ==== Requirements for Hyper-V ==== %%% %%% \transdissolve %%% %%% +<2->{ %%% * minimum requirements: %%% ** SMB 3.0 %%% ** is that really all??? - maybe resilient file handles.. %%% } %%% +<3->{ %%% * desired features: %%% ** cluster ($\ge 2$ nodes) %%% ** CA / persistent handles %%% ** RDMA / SMB direct %%% ** multi channel %%% } %%% ==== SMB Protocol in Samba ==== %%% %%% \transdissolve %%% %%% +<2->{ %%% * Samba $<$ 3.5: %%% ** SMB 1 %%% * Samba 3.5: %%% ** experimental incomplete support for SMB 2.0 %%% * Samba 3.6: %%% ** official support for SMB 2.0 %%% ** missing: durable handles %%% ** default server max proto: SMB 1 %%% * Samba 4.0: %%% ** SMB 2.0: complete with durable handles %%% ** SMB 2.1: basis, multi-credit, dynamic reauthentication %%% ** SMB 3.0: basis, crypto, secure negotiation, durable v2 %%% ** default server max proto: SMB 3.0 %%% * Samba 4.1 %%% ** SMB 3.02: basic %%% } ==== ==== [plain] <[center] {\Large Technical Details... } [center]> %%% ==== ====[plain] %%% %%% \transdissolve %%% %%% <<>> %%% %%% %%% ==== Multi-Channel - Windows/Protocol ==== +<2->{ * find interfaces with interface discovery: \\ % @FSCTL\_QUERY\_NETWORK\_INTERFACE\_INFO@ * bind additional TCP (or RDMA) connection (channel) to established SMB3 session (session bind) * bind (TCP) connections of same quality * bind only to a single node * replay / retry mechanisms, epoch numbers } ==== Multi-Channel - Samba ==== +<2->{ * samba/smbd: multi-process ** process $\Leftrightarrow$ tcp connection ** ==> transfer new connection to existing smbd ** use fd-passing (sendmsg/recvmsg) } +<3->{ * preparation: messaging rewrite using unix dgm sockets with sendmsg [DONE,Volker] * add fd-passing [WIP] * transfer connection already in negprot (ClientGUID) [TODO] * implement channel epoch numbers [started] * implemnt interface discovery [TODO] } ==== SMB Direct (RDMA) ==== +<2->{ * windows: ** requires multi-channel ** start with TCP, bind an RDMA channel ** reads and writes use RDMB write/read ** protocol/metadata via send/receive } +<3->{ * wireshark dissector: [DONE (Metze)] } +<4->{ * samba (TODO): ** prereq: multi-channel / fd-passing ** buffer / transport abstractions [TODO] ** central daemon (or kernel module) to serve as RDMA "proxy" \\ % (libraries: not fork safe and no fd-passing) } ==== SMB Direct (RDMA) - Plan ==== +<2->{ * smbd-d (?) listens for RDMA connection * main smbd listens for TCP connection * main smbd listens (for RDMA) via unix socket connect to smbd-d * client connects via TCP --> smbd forks child smbd (c1) * client connects via RDMA to smbd-d * smbd-d notifies main smbd and transfers connection info * smbd forks child (c2) that inherits connection to smbd-d * c2 smbd passes [connection to smbd-d] to c1 (via ClientGUID) and exits * c1 establishes mmap area with smbd-d * client does rdma calls to smbd-d ** metadata and protocol calls are transferred via socket to tcp-smbd ** rdma read/write directly to tcp-smbd via mmap area } %%% ==== Persistent Handles ==== %%% %%% \transdissolve %%% %%% +<2->{ %%% * like durable file handles with strong guarantees %%% * framework is already there in samba (by support for durable v2) %%% ** ==> easy to satisfy at the protocol level %%% } %%% +<3->{ %%% * the difficulty lies in implementing the guarantees %%% ** need make metadata persistent %%% ** but don't kill performance! %%% ** persistent tdbs !would! kill performance %%% ** ideas: %%% *** need to be sync %%% *** record-level transactions (instead of db-level) %%% *** only replicate to some nodes, not all %%% } ==== Clustering Concepts (Windows) ==== \transdissolve +<2->{ * Cluster: ** (``traditional'') failover cluster (active-passive) ** protocol: @SMB2\_SHARE\_CAP\_CLUSTER@ ** Windows: *** runs off a cluster (failover) volume *** offers the Witness service } +<3->{ * Scale-Out (SOFS): ** scale-out cluster (all-active!) ** protocol: @SMB2\_SHARE\_CAP\_SCALEOUT@ ** no client caching ** Windows: runs off a cluster shared volume (implies cluster) } +<4->{ * Continuous Availability (CA): ** transparent failover, persistent handles ** protocol: @SMB2\_SHARE\_CAP\_CONTINUOUS\_AVAILABILITY@ ** can independently turned on on any cluster share (failover or scale-out) ** ==> changed client retry behaviour! } %%% ==== Clustering -- Controlling Flags from Windows ==== %%% %%% \transdissolve %%% %%% +<2->{ %%% * a share on a cluster carries %%% ** @SMB2\_SHARE\_CAP\_CLUSTER@ $\Leftrightarrow$ the shared FS is a cluster volume. %%% } %%% +<3->{ %%% * a share on a cluster carries %%% ** @SMB2\_SHARE\_CAP\_SCALEOUT@ $\Leftrightarrow$ the shared FS is a CSV %%% *** implies @SMB2\_SHARE\_CAP\_CLUSTER@ %%% } %%% +<4->{ %%% * independently settable on a clustered share: %%% ** @SMB2\_SHARE\_CAP\_CONTINUOUS\_AVAILABILITY@ %%% *** implies @SMB2\_SHARE\_CAP\_CLUSTER@ %%% } %%% ==== Clustering -- Server Behaviour ==== \transdissolve +<2->{ * @SMB2\_SHARE\_CAP\_CLUSTER@: ** run witness service (RPC) ** client can register and get notified about resource changes } +<3->{ * @SMB2\_SHARE\_CAP\_SCALEOUT@: ** do not grant batch oplocks, write leases, handle leases ** ==> no durable handles unless also CA } +<4->{ * @SMB2\_SHARE\_CAP\_CONTINUOUS\_AVAILABILITY@: ** offer persistent handles ** timeout from durable v2 request } ==== Clustering -- Client Behaviour (Win8) ==== \transdissolve +<2->{ * @SMB2\_SHARE\_CAP\_CLUSTER@: ** clients happily work if witness is not available } +<3->{ * @SMB2\_SHARE\_CAP\_SCALEOUT@: ** clients happily connect if @CLUSTER@ is not set. ** clients DO request oplocks/leases/durable handles ** clients are not confused if they get these } +<4->{ * @SMB2\_SHARE\_CAP\_CONTINUOUS\_AVAILABILITY@: ** clients happily connect if @CLUSTER@ is not set. ** clients typically request persistent handle with RWH lease } %%%+<5->{ %%%* Note:\\ % %%%Win8 sends @SMB2\_FLAGS\_REPLAY\_OPERATION@ in writes and reads (from 2nd in a row) \\ % %%%$\Leftrightarrow$ \\ % %%%The server announces @SMB2\_CAP\_PERSISTENT\_HANDLES@. %%%} %%% ==== Clustering -- Client Behaviour (Win8) : Retries ==== %%% %%% +<2->{ %%% * Test: Win8 against slightly pimped Samba (2 IPs) %%% } %%% +<3->{ %%% * Server-Matrix (on/off): %%% ** persistent handle cap %%% ** durable handles %%% ** cluster share cap %%% ** scale out cap %%% ** ca share cap %%% } %%% +<4->{ %%% * The test: %%% ** connect to share with explorer %%% ** start copying file (2G) %%% ** kill smbd %%% ** wait for the client to pop up an error dialog %%% ** click cancel %%% ** stop capture %%% } %%% %%% ==== Clustering -- Client Behaviour (Win8) : Retries ==== %%% %%% +<2->{ %%% * only two different retry characteristics: CA $\leftrightarrow$ non-CA %%% } %%% +<3->{ %%% * non-CA-case %%% ** 3 consecutive attempt rounds: %%% *** for each of the two IPs: \\ % %%% arp IP \\ % %%% three tcp syn attempts to IP with 0.5 sec breaks %%% ** ==> some 2.1 seconds for 1 round %%% ** between attempts: %%% ** dns, ping, arp ... 5.8 seconds %%% ** ==> _red_18 seconds_ %%% } %%% +<4->{ %%% * CA-Case %%% ** retries attempt rounds from above for _red_14 minutes_ %%% } %%% %%% %%% %%% ==== ====[plain] %%% %%% \transdissolve %%% %%% <[center] %%% <<>> %%% [center]> %%% %%% ==== Clustering with Samba/CTDB ==== +<2->{ * all-active SMB-cluster with Samba and CTDB... \\ % +<3->{...since 2007! \smiley } } +<4->{ * transparent for the client ** CTDB: *** metadata and messaging engine for Samba in a cluster *** plus cluster resource manager (IPs, services...) ** client only sees one ``big'' SMB server ** we could not change the client!... ** works ``well enough'' } +<5->{ * challenge: ** how to integrate SMB3 clustering with Samba/CTDB ** good: rather orthogonal ** ctdb-clustering transparent mostly due to management } ==== Witness Service ==== +<2->{ * an RPC service ** monitoring of availability of resources (shares, NICs) ** server asks client to move to another resource } +<3->{ * remember: ** available on a Windows SMB3 share $\Leftrightarrow$ @SMB2\_SHARE\_CAP\_CLUSTER@ ** but clients happily connect w/o witness } +<4->{ * status in Samba [WIP (Metze, Gregor Beck)]: ** async RPC: WIP, good progress ($\Rightarrow$ Metze's talk) ** wireshark dissector: essentially done ** client: in @rpcclient@ - done ** server: dummy PoC / tracer bullet implementation done ** CTDB: changes / integration needed } %%% ==== ====[plain] %%% %%% <[center] %%% {\Large %%% !@https://wiki.samba.org/index.php/SMB3@! %%% } %%% [center]> %%% %%% ==== ====[plain] %%% %%% \transdissolve %%% %%% <[center] %%% <[columns] %%% [[[.6\textwidth]]] %%% %%% [[[.3\textwidth]]] %%% <<>> %%% [columns]> %%% [center]> %%% ==== ====[plain] \transdissolve <[center] <[columns] [[[.6\textwidth]]] {\Large Questions? --*4em-- @obnox\@samba.org@ @ma\@sernet.de@ } [[[.3\textwidth]]] <<>> [columns]> [center]> %%% %%%% <[center] %%% %%%% %%% %%%% {\Large %%% %%%% %%% %%%% @obnox\@samba.org / ma\@sernet.de@ %%% %%%% %%% %%%% \vspace*{1em} %%% %%%% %%% %%%% %%%<<>> %%% %%%% <<>> %%% %%%% } %%% %%%% [center]>