Handbook of Floating-Point ArithmeticThis handbook is a definitive guide to the effective use of modern floating-point arithmetic, which has considerably evolved, from the frequently inconsistent floating-point number systems of early computing to the recent IEEE 754-2008 standard. Most of computational mathematics depends on floating-point numbers, and understanding their various implementations will allow readers to develop programs specifically tailored for the standard’s technical features. Algorithms for floating-point arithmetic are presented throughout the book and illustrated where possible by example programs which show how these techniques appear in actual coding and design. The volume itself breaks its core topic into four parts: the basic concepts and history of floating-point arithmetic; methods of analyzing floating-point algorithms and optimizing them; implementations of IEEE 754-2008 in hardware and software; and useful extensions to the standard floating-point system, such as interval arithmetic, double- and triple-word arithmetic, operations on complex numbers, and formal verification of floating-point algorithms. This new edition updates chapters to reflect recent changes to programming languages and compilers and the new prevalence of GPUs in recent years. The revisions also add material on fused multiply-add instruction, and methods of extending the floating-point precision. As supercomputing becomes more common, more numerical engineers will need to use number representation to account for trade-offs between various parameters, such as speed, accuracy, and energy consumption. The Handbook of Floating-Point Arithmetic is designed for students and researchers in numerical analysis, programmers of numerical algorithms, compiler designers, and designers of arithmetic operators. |
Other editions - View all
Common terms and phrases
2Sum adder algorithm approximation arithmetic operations assume binary floating-point binary32 binary32 arithmetic binary32 format binary64 Chapter compiler complex computed decimal deduce defined digits division dot product double double-word elementary functions emax emin error bound evaluation exact result exactly representable example exponent range expression Fast2Sum finite floating floating-point arithmetic floating-point numbers floating-point operations floating-point system FMA instruction FPGAs fused multiply-add Gappa handling Hardest-to-round points hardware Horner's rule IEEE implementation input instance integer interval arithmetic iteration leading-zero metic MPFR multiplication nonzero normal number obtained operands overflow performed polynomial possible precision precision-p processor proof qNaN radix radix-2 range reduction real numbers relative error requires RN(l RN(x rounding function rounding modes rounding to nearest Section shift significand specified square root sticky bit subnormal numbers subtraction Table Theorem tion underflow variable vector