1471 Memory Mapped IO

A request to memory-map a file into an address space is handled by the file systenanode method vop_map and theseg_vn memory segment driver see Section 14.7.4 . A process requests that a file be mapped into its address space. Once the mapping is established, the address space represented by the file appears as regular memory and the file system can perform I O by simply accessing that memory. Memory mapping of files hides the real work of reading and writing the file because the seg_vn memory...

12231 The Translation Table Entry

Each entry of the TLB consists of a Translation Table Entry TTE , which describes the mapping and provides details of its associated properties. The TTE may be thought of as corresponding to a page table entry, or PTE, in the sun4m architecture. A TTE is made up of two components, the tag and the translation data, each of length 64 bits. The TTE tag contains the encoded virtual address and context ID Figure 12.7 , and the TTE data contains the corresponding physical address together with...

131 Solaris Kernel Architecture

The Solaris kernel is grouped into several key components and is implemented in a modular fashion System call interface. The system call interface allows user processes to access kernel facilities. The kernel then performs specific tasks on behalf of the calling process, such as reading or writing a file, or establishing a network connection. The system call layer consists of a common system call handler, which vectors execution into the appropriate kernel modules. Process execution and...

2101 Procfs Implementation

Proofs is implemented as a dynamically loadable kernel module, kernel fs procfs, and is loaded automatically by the system at boot time, proc is mounted during system startup by virtue of the defaulfproc entry in the etc vfstab file. The mount phase causes the invocation of the procfs prinit initialize and prmount file-system-specific functions, which initialize thevfs structure for procfs and create and initialize a vnode for the top-level directory file, proc. The kernel memory space for the...

242 User Area

The role of the user area traditionally referred to as the uarea , has changed somewhat in the Solaris environment when compared with traditional implementations of UNIX. The uarea was linked to the proc structure through a pointer and thus was a separate data structure. The uarea was swappable if the process was not executing and memory space was tight. Today, thearea is embedded in the process structure. The process kernel stack, which was traditionally maintained in the uarea, is now...

intrstat

device cpuO tim cpul tim cpu2 tim cpu3 tim ata 1 0 0.0 4 0.0 0 0.0 0 0.0 bge 01 1 0.0 0 0.0 0 0.0 0 0.0 mpt 0 0 0.0 12661 4.8 0 0.0 0 0.0 device cpuO tim cpul tim cpu2 tim cpu3 tim ata 1 0 0.0 0 0.0 0 0.0 0 0.0 bge 0 6 0.0 0 0.0 0 0.0 0 0.0 mpt 0 0 0.0 12630 4.7 0 0.0 0 0.0 means it is likely that CPU 1 has device interrupt bindings. Using The intrstat 1 data shows us that device mpt 0 mpt is the device nomenclature, 0 refers to instance 0 of the device is generating interrupts to CPU 1, which...

Chapter 12 Hardware Address Translation

The hardware address translation HAT layer controls the hardware that manages mapping of virtual memory to physical memory. The HAT layer interfaces implement the creation and destruction of mappings between virtual and physical memory and probe and control the MMU. The HAT layer also implements all the low-level trap handlers to manage page faults and memory exceptions. Figure 12.1 shows the logical demarcation between elements of the HAT layer. Figure 12.1. Role of the HAT Layer in...

Figure 177 Turnstiles

turnstile_t tc_first first turnstile on hash chain disp_lock_t tcjock lock for this hash chain turnstile_chain_t turnstile_chain_t turnstile_table 2 TURNSTILE_HASH_SIZE Each entry in the chain has its own lock,tcjock, so chains can be traversed concurrently. The turnstile itself has a different lock each chain has an active list ts_next and a free list ts_free . There are also a count of threads waiting on the sync object gaiters , a pointer to the synchronization object ts_sobj , a thread...

6103 Kstats

The kstat framework is used extensively for applications monitoring the system's performance including mpstat 1 M , vmstat 1 M , iostat 1 M , and kstat 1 M . Unlike traditional Solaris systems, the values reported in a particular kstat may not be relevant or accurate in non-global zones. When executing in a zone and if the pools facility is active, mpstat 1 M will only provide information for those processors which are a member of the processor set of the pool to which the zone is bound. The...

678 TCP Connection Teardown

The ioctl TCP_IOC_ABORT_CONN can abort existing TCP connections and unconnected TCP endpoints without waiting for a timeout. It is used by the Sun Cluster and Netra High Availability HA Suite products to allow quick failover of IP addresses from one node to another. Thetcp_ic_abort_conn_t structure has been extended to include a zone ID, allowing all connections associated with a given zone to be terminated this is used internally as part of zone shutdown. You can preserve the previous behavior...

192 Solaris Resource Management

The naming convention for resource management changes after Solaris 8. For Solaris 8, an unbundled product caWedSolaris Resource Manager, or SRM, is available. In Solaris 9 and 10, resource management is integrated into the operating system. Generically speaking, resource management refers to specific software components and utilities used to manage hardware resources. Each release builds on the functions and features of the previous release whereas Solaris 8 requires the installation of SRM...

1884 Checksum Offload

Solaris 10 improved the hardware checksum offload capability further to improve overall performance for most applications. A 16-bit, one's complement, checksum offload framework has existed in Solaris for some time. It was originally added as a requirement for Zero Copy TCP IP in the Solaris 2.6 release but was only recently extended to handle other protocols. Solaris 10 defines two classes of checksum offload Full. Complete checksum calculation in the hardware, including pseudo-header checksum...

1111 Kernel Address Space

The kernel virtual memory layout differs from platform to platform, mostly based on the platform's MMU architecture. On x86, and platforms earlier than the sun4u, the kernel uses the top 256 Mbytes or 512 Mbytes of a common virtual address space, shared by the process and kernel see Section 9.4 . Sharing the kernel address space with the process address space limits the amount of usable kernel virtual address space to 256 Mbytes and 512 Mbytes, respectively, which is a substantial limitation on...

User Process Address Space Kernel Address Space

A read system call invokes the file-system-dependentvop_read function. The vop_read method calls into the seg_map segment to locate a virtual address in the kernel address space via segkpm for the file and offset requested with thesegmap_getmapflt function. The seg_map driver determines whether it already has a slot for the page of the file at the given offset by looking into its hashed list of mapping slots. Once a slot is located or created, an address for the page is located, and segmap then...

Figure 112 Kernel Address Space

The full list of segment drivers the kernel uses to create and manage kernel mappings is shown in Table 11.3. The majority of the kernel segments are manually calculated and placed for each platform, with the base address and offset hard-coded into a platform-specific header file. See Appendix A for a complete reference of platform-specific kernel allocation and address maps. Table 11.3. Solaris Kernel Memory Segment Drivers Table 11.3. Solaris Kernel Memory Segment Drivers Allocates and maps...

1224 The Translation Storage Buffer TSB

Since searching the HME hash chains for a translation on every TLB miss would be very expensive, Solaris caches the TTE's in a software-controlled cache the TSB . In Solaris 10 a process can have up to two TSBs that are allocated, grown, and shrunk on demand. Each TSB in the system is represented by its own tsbjnfo structure, and the HAT maintains a list oftsbjnfo structures for TSBs used by a process. Let's look at an actual TSB. We can get the list of tsbjnfo structures from the hat structure...

Figure 38 Dispatcher Global Priorities

ts_quantum. The time quantum the amount of time that a thread at this priority is allowed to run before it must relinquish the processor, have its priority reset, and be assigned a new time quantum. Be aware that the ts_dptbl 4 man page, as well as other references, indicates that the value in the ts_quantum field is in ticks. A tick is a unit of time that can vary from platform to platform. In Solaris, there are 100 ticks per second, so a tick occurs every 10 milliseconds. The value in...

685 Zone Console Design

Figure 6.1 demonstrates that zones export a virtualized console. More generally, the system's console is an important and widely referenced notion as seen in previous examples, the zone console is a natural and familiar extension of the system for administrators. While zone consoles are similar to the traditional system console, they are not identical. In general, the notion of a system console has the following properties Applications may open and write data to the console device. The console...

111 Solaris 10

The list of new technologies integrated into Solaris 10 represents some of the most innovative work done in a volume production operating system. Predictive self-healing. The term predictive self-healing describes the benefits derived from the integration of the Solaris Fault Manager and Solaris Service Manager technologies. Predictive self-healing maximizes the availability of a Solaris system and the services it provides when hardware and software faults occur. Facilities for event detection,...

1892 Dynamic Switch between Interrupt vs Polling Mode

The popularity of 10-Gbit NICs is increasing in the data center because, apart from throughput, 10-Gbit NICs simplify the data center wiring and offer better latencies. To handle the interrupt load from a 10-Gbit NIC, the NIC vendors use interrupt coalescing schemes, by which they interrupt the CPU according to either n number of packets received orf time elapsed. This scheme was employed by several 1-Gbit NICs as well. It suffers from the fact that under lower load, when interrupts are firing...

11333 Importing from Another Arena

Vmem allows one arena to import its resources from another. vmem_create specifies the source arena and the functions to allocate and free from that source. The arena imports new spans as needed and gives them back when all their segments have been freed. The power of importing lies in the side effects of the import functions and is best understood by example. In Solaris, the function segkmem_alloc invokes vmem_alloc to get a virtual address and then backs it with physical pages. Therefore, we...

14103 DNLC Negative Cache

The DNLC has support for negative caching. Some applications repeatedly test for the existence or nonexistence of a file for example, a lock file or a results file . In addition, many shell PATH variables list directories that don't exist. For these applications, caching the fact that This document was created by an unregistered ChmMagic, please go to http www.bisenter.com to register it. Thanks the file doesn't exist negative caching is a performance boost. The DNLC negative cache follows the...

1023 Free List and Cache List

The free list and the cache list hold pages that are not mapped into any address space and that have been freed by page_free . The sum of these lists is reported in the free column in vmstat. Even thoughvmstat reports these pages as free, they can still contain a valid page from a vnode offset and hence are still part of the global page cache. Memory on the cache list is not really free, it is a valid cache of a page from a file. However, pages will be moved from the cache list to the free list...

96 Anonymous Memory

At many points we have mentioned anonymous memory. Anonymous memory refers to pages that are not directly associated with node. Such pages are used for a process's heap space, its stack, and copy-on-write pages. In the Solaris kernel, two subsystems are dedicated to managing anonymous memory the anon layer and the swapfs file system. The anonymous memory allocated to the heap of a process is a result of a zero-fill-on-demand operation ZFOD . The ZFOD operation is how we allocate new pages. A...

1831 Vertical Perimeter

An squeue guarantees that only a single thread can process a given connection at any given time, thus serializing access to the TCP connection structure by multiple threads both from the read and write side in the merged TCP IP module. It is similar to the STREAMS QPAIR perimeter, but instead of just protecting a module instance, it protects the whole connection state from IP to sockfs the socket file systemthe implementation of sockets in Solaris introduced in the Solaris 8 release . Vertical...

2114 MDB Components and Their Implementation in kmdb

We now use our earlier discussion ofmdbto motivate our review of the major subsystems used b kmdb. Recall that the three subsystems discussed were the target layer, module management, and terminal management termio . The implementation of kmdb is largely the story of the replacement of support libraries with subsystems designed to work in kmdb's unique environment. Figure 21.2 shows how these replacement subsystems relate to the core debugger. The target layer itself is unchanged in kmdb. What...

55 Least Privilege Interfaces

In this section we describe the details of how the implementation of process privileges is layered in the kernel and the interfaces offered between various subsystems. 5.5.1. The Conspiracy of Bit Sets and Constants The most convenient data structure for privileges and privilege sets are integers and bit sets. A few words of memory are indexed by bit and a privilege number as an index. This is how the implementation in Trusted Solaris works as well as most implementations that are based on the...

1336 CPU Specific Large Page Support

The TLB configurations are quite different across versions of UltraSPARC processors, but they share a few items in common. UltraSPARC I through IV support four page sizes 8 Kbytes, 64 Kbytes, 512 Kbytes, and 4 Mbytes. In addition, there are separate TLBs for the instruction and data paths. UltraSPARC I and II. The UltraSPARC I and II microprocessors 143 MHz480 MHz have two TLBs, one for the instruction path and one for the data path. Each TLB is a 64entry, fully associative TLB that supports...

942 Address Space Callbacks

An address space callback is a facility which supports the ability to inform clients of specific events pertaining to address space management. An example of such an event is an address space unmap requestto prevent holding the address space's lock aJock for a large amount of time during an unmap which can cause ps 1 and other tools to hang , the unmap is performed as a callback without holding the ajock. As one example, we use this facility to prevent an NFS server timeout from hangings. A...

Figure 1413 Solaris DNLC

The lookup algorithm uses a rotor pointing to a hash chain the rotor switches chains for each invocation of dnlc_enter that needs a new entry. The algorithm starts at the end of the chain and takes the first entry that has a vnode reference count of 1 or no pages in the page cache. In addition, during lookup, entries are moved to the front of the chain so that each chain is sorted in LRU order. The DNLC was enhanced to use the kernel memory allocator to allocate a variable length string for the...

dtrace I n changepri

ID PROVIDER MODULE FUNCTION NAME Here's a simple script that enables all the change-pri probes, uses the count aggregating function, and keys the aggregation on the probe function name, the name of the executable, the thread ID, and the thread priority.

Door Server Process

On the server side, a function defined in the process can be made available to external client processes by creation of a door door_create 3X . The server must also bind the door to a file in the file system namespace. This is done with fattach 3C , which binds a STREAMS-based or door file descriptor to a file system path name. Once the binding has been established, a client can issue an open to the path name and use the returned file descriptor in door_caii 3X . Doors are implemented in the...

11492 Finding References to Data

When trying to diagnose a memory corruption problem, you should know what other kernel entities hold a copy of a particular pointer. This is important because it can reveal which thread accessed a data structure after it was freed. It can also make it easier to understand what kernel entities are sharing knowledge of a particular valid data item. You use the whatis and kgrep demds to answer these questions. You can apply whatis to a value of interest. 705d8640 is 705d8640 0, allocated from...

152 System V IPC

Three types of IPC originally developed for System V UNIX have become standard across all UNIX implementations shared memory, message passing, and semaphores. These facilities provide the common IPC mechanism used by the majority of applications today. System V Shared Memory. Processes can create a segment of shared memory. Changes within the area of shared memory are immediately available to other processes that attach to the same shared memory segment. System V Message Queues. A message queue...

151 Traditional UNIX IPC

Solaris implements the traditional UNIX IPC facilities pipes, named pipes, and UNIX domain sockets. A pipe directly channels data flow between two related processes through an object that operates like a file. Data is inserted at one end of the pipe and travels to the receiving process in a first-in, first-out order. Data is read and written on a pipe with the standard file I O system calls. Pipes are created with the pipe 2 system call. Named pipes, also commonly known as FIFOs which stands...

12364 Physical Memory and DMA

The lists used to track free physical memory pages are broken up into four ranges by address that roughly track the sort of legacy DMA ranges needed in the past on the PC architecture Memory allocations for devices with DMA range limitations are directed to the appropriate memory range. All other memory allocations go to the highest range with available memory. Once physical pages are allocated and mapped into the kernel for DMA, the I O system tracks the memory, using power-of-two allocation...

752 What Is an rctl

The resource control framework generalizes the rlimit-process relationship to a resource control-entity relationship. In this sense, a resource control is some amount of information associated with the entity pertinent to resource management operations. The kernel subsystem publishing the resource control can associate default actions to be taken at various thresholds of the resource usage these actions can be modified or new actions on new thresholds can be introduced by the user process or by...

133 Configuring for Multiple Page Sizes

Once we determine that our application warrants the use of large pages, we need to construct a strategy for determining what parts of our application to enhance to use large pages. For example, should we attempt to enable large pages for our target process's heap, stack, text, etc. The TRapstat utility gives us a little information about the types of our address space that incur the TLB misses. The instruction TLB TLB miss information is likely a result from the process's text and library text...

154 Solaris Doors Advanced Solaris IPC

Solaris Doors are a new, fast, lightweight mechanism for calling procedures between processes. Doors are a low latency method of invoking a procedure in a different process on the same system. A door server contains a thread that sleeps, waiting for an invocation from the door client. A client makes a call to the server through the door, along with a small 16-Kbyte payload. When the call is made from a door client to a door server, scheduling control is passed directly to the thread in the door...

172 The Cyclic Page Cache

All modern operating systems implement some form of file system caching, such that frequently referenced files have their contents in physical memory, providing significantly faster access. The virtual memory system and file system page cache have been tightly integrated in Solaris from the very beginning, originating in SunOS 4.0. In Solaris OS, all of free physical memory can be used to cache file system The original page cache design implemented a special kernel address space segment, called...

1223 The Translation Table

There are many ways to implement translation tables or page tables. The older SPARC sun4m and sun4d architectures employ a three-level page table as described in the SPARC Reference MMU SRMMU specification. The first-level page table consists of 256 entries that point to 256 second-level tables. In turn, each second-level table points to 64 third-level tables that contain the actual page table entries. The problem with multilevel page tables in general is that they are inefficient in terms of...

282 A Tour through a System Call

See See copy 1st arg, clear high bits copy 2nd arg, clear high bits Idx g1 CPU_STATS_SYS_SYSCALL , g2 stx g2, g1 CPU_STATS_SYS_SYSCALL See The libc wrapper placed up to the first 6 arguments in oO through o5, with the rest, if any, on stack. Duringsys_trap, a SAVE instruction obtained a new register window, so those arguments are now available in the corresponding input registers, despite our not performing a save in syscall_trap32 itself. We're going to call the real handler, so we prepare the...

1752 Solaris Mutex Lock Implementation

The kernel defines different data structures for the two types of mutex locks, adaptive and spin, as shown below. Public interface to mutual exclusion locks. See mutex 9F for details. The basic mutex type is MUTEX_ADAPTIVE, which is expected to be used in almost all of the kernel. MUTEX_SPIN provides interrupt blocking and must be used in interrupt handlers above LOCK_LEVEL. The block cookie argument to mutex_init encodes the interrupt level to block. The block cookie must be NULL for adaptive...

922 Address Spaces on SPARC Systems

The process address space on SPARC systems varies across different SPARC platforms according to the MMU on that platform. SPARC has three different address space layouts The SPARC V7 combined 32-bit kernel and process address space, found on sun4c, sun4d, and sun4m machines. Note that support for SPARC V7 exists only in Solaris 9 and earlier. The SPARC V9 32-bit separated kernel and process address space model, found on sun4u machines The SPARC V9 64-bit separated kernel and process address...

1424 File Descriptor Limits

Each process has a hard and soft limit for the number of files it can have opened at any time these limits are administered through the Resource Controls infrastructure by process.max-file-descriptor see Section 7.5 for a description of Resource Controls . The limits are checked during falloc . Limits can be viewed with theprctl command. sol9 prctl -n process.max-file-descriptor NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT privileged 65.5K - deny system 2.15G max deny If no resource controls are...

951 The vnode Segment segvn

The most widely used segment driver is thevnode segment driver, seg_vn. The seg_vn driver maps files orvnodes into a process address space, using physical memory as a cache. The seg_vn segment driver also creates anonymous memory within the process address space for the heap and stack and provides support for System V non ISM shared memory. See Section 4.4. The seg_vn segment driver manages the following mappings into process address space Executable text Executable data Heap and stack...

2811 Handling a System Call Trap

A SPARC trap instruction a n executed in userland by the wrapper function results in a trap typ x100 n being taken, and we move from trap-level 0 TLO where all userland and most kernel code executes to trap-level 1 TL1 in nucleus context. Code that executes in nucleus context has to be hand-crafted in assembler since nucleus context does not comply with the ABI conventions and is generally much more restricted in what it can do. The task of the trap handler executing at TL1 is to provide the...

38 Dispatcher Functions

The dispatcher's primary functions are to decide which runnable thread gets executed next, to manage the context switching of threads on and off processors, and to provide a mechanism for inserting into a dispatch queue kthreads that become runnable. Other dispatcher functions handle initialization and scheduling class loading, the lock functions previously discussed, preemption, and support for user and administrative commands, such as dispadmin 1 M and priocntl 1 , that monitor or change...

12236 hmeblk Hash Tables

The sun4u kernel maintains two hashed tables ofhme_blk structures one for the kernel address space and one for all user address spaces. The kernel table is pointed to by the kernel variable khme_hash, and the user table is pointed to byjhme_hash. These tables are represented as an array of hmehash_bucket structures. The number of buckets in the user hash is defined by the variabl ihmehash_num the number of buckets in the kernel hash is defined by khmehash_num. uint64_t hmeh_nextpa physical...

1575 Rolling the Log

Occasionally, the data in the log needs to be written back to the file system, a procedure called log rolling. Log rolling occurs for the following reasons To update the on-disk file system with committed metadata deltas To free space in the log for new deltas To roll the entire log to disk at unmount To partially roll the on-disk log when it is getting full To completely roll the log with the _FIOFFS ioctl file system flush To partially roll the log every 5 seconds when no new deltas exist in...