matlab代写-CSCC43
时间:2022-10-20
CSCC43 - Summary 2 FLOATING POINT ARITHMETIC 1 What is Numerical Analysis? Physical Systems are typically transformed into Mathematical Models by engineers, chemists, physi- cists... Mathematical Models can then have a closed form solution (rarely!), however most times there will only be a numerical solution which requires Numerical Analysis 2 Floating Point Arithmetic 2.1 Representation of non-negative integers Decimal System (Base 10): 350 = (350)10 = 3 · 102 + 5 · 101 + 0 · 100 Binary System (Base 2): 350 = (101011110)2 = 1 · 28 + 0 · 27 + · · ·+ 1 · 21 + 0 · 20 In general, we have Base b System, for b > 0, b 2 N x = (dndn 1 . . . d0)b = dn · bn + dn 1 + · · ·+ d0 · d0 where x 0, x 2 N, 0 di < b, i 2 [1, n] Hexadecimal System (Base 16): Symbols 0, 1, . . . , 8, 9, A,B,C,D,E, F (8FA)16 = 8 · 162 + F · 161 +A · 160 = 8 · 162 + 15 · 161 + 10 · 160 = (2298)10 Converting decimal to base b Example, (350)10 in base b = 2 Numerator Denominator Quotient Remainder 350 2 175 0 175 2 87 1 87 2 43 1 43 2 21 1 21 2 10 1 10 2 5 0 5 2 2 1 2 2 1 0 1 2 0 1 Once we hit 0 in the Quotients column, we simply read the Remainder column from bottom up and we have our conversion. Thus (350)10 = (101011110)2. This algorithm is safe (will always terminate). Where overflow is the only potential problem. 2.2 Representation of real numbers if x 2 R then x = ±(xI .xF )b = ±(dndn 1 . . . d0.d 1d 2 . . . )b Where xI is the integral part and xF is the fractional part. The sign (+ or ) is a single bit 0 or 1 xI is the non-negative integer talked about above, and xF can have infinite digits. Example: (0.77 . . . )10 = (0.7¯)10 = 7 · 10 1 + 7 · 10 2 + · · · Binary System (Base 2): (0.77 . . . )10 = (.000110011 . . . )2 = (0.0 ¯0011)2 = 0·2 1+0·2 2+0·2 3+1·2 4+· · · In general, we have Base b System, for b > 0, b 2 N xF = (.d 1d 2d 3 . . . )b = d 1 · b 1 + d 2 · b 2 + · · · = 1X i=1 d i · b i We say that a Binary Fraction with n digits is terminating if it does not have any cycling and is finite, furthermore it has a terminating decimal representation. Page 1 CSCC43 - Summary 2 FLOATING POINT ARITHMETIC Converting Decimal Fractions to base b Example, (0.625)10 in base b = 2 Multiplier Base Product Integral Fraction 0.625 2 1.25 1 0.25 .25 2 0.5 0 0.5 0.5 2 1 1 0 Once we hit 0 in the fractions column, we simply read the Integral column from top down and we have our conversions. Thus (0.625)10 = (.101)2 However you may not always hit 0 as this algorithm is not safe (won’t always terminate)! For example: What is (0.1)10 in binary? Multiplier Base Product Integral Fraction 0.1 2 0.2 0 0.2 0.2 2 0.4 0 0.4 0.4 2 0.8 0 0.8 0.8 2 1.6 1 0.6 0.6 2 1.2 1 0.2 We see that 0.2 occurs again as a fraction. This means that there must be a cycle! Thus (0.1)10 = (0.0 ¯0011)2 2.3 Machine Representation of Reals Reals represented in computers as Floating Point numbers (FP for short). A FP x in base b has the form: x = (F )b ·b(e)b , where F is the fraction and has the form F = ±(d1, d2 · · · dt)b such F is also called the mantissa. and e = ±(cscs 1 . . . c1)b is called the exponent A FP number is normalized if d1 6= 0 unless d1 = d2 = · · · = dt = 0 Significant Digits: (Of a nonzero FP) are the digits following and including the first nonzero digit. In the case of a normalized FP, all digits of the mantissa is significant (storing useful information). Absolute value of mantissa is always 0 and < 1 Exponent is limited: e 2 [ M,M ] where M = (asas 1 · · · a1)b with each ai = (b 1) Largest FP number in absolute value defined by this system is (.aa . . . a)b · b(aa...a)b where each a = (b 1) Smallest nonzero normalized FP number in absolute value is (0.10 . . . 0)b · b (aa...a)b , where each a = (b 1) Non-normalized FPs allow us to get very very close to 0, consider (.00 . . . 01)b · b (aa...a)b Rb(t, s) denotes the set of all floating point numbers with t digit mantissa and s digit exponent. And any Rb(t, s) is finite, while R is infinite. Overflow orUnderflow occurs whenever a nonzero floating point number with absolute value outside these ranges must be stored on the computer. Note that when the number is too close to zero it’s called underflow. not when its largely negative. Furthermore, R is compact, where as Rb(t, s) is not. This is to say that between any two real numbers, there is an infinite number of reals in between. (infinite density). A real number x = ±(xI .xF )b = ±(dkdk 1 . . . d0.d 1d 2 . . . )b We say we convert a real x to a fp number with the following notation, x 2 R! Fl(x) 2 Rb(t, s) Page 2 f砻 0 1X26 1xzz 以27 1Xp Nz8 惢 CSCC43 - Summary 2 FLOATING POINT ARITHMETIC We can represent this in Rb(t, s) by the following algorithm: 1. Normalizing the mantissa. We shift the decimal point, and readjust with the exponent x = ±(dkdk 1 . . . d0.d 1d 2 . . . )b = ±(.D1D2 . . . )b · bk+1 2. chop or round the mantissa (a) chopping - chop after digit t of the mantissa. (b) rounding - chop after digit t then round Dt depending on whether Dt+1 b/2 or Dt+1 < b/2. A more e cient way to round is to add b/2 to Dt+1, then simply chop. Example Consider Fl(3/2) 2 R10(2, 4) = ( +0.66 if chop +0.67 if round Round o↵ Error Precisely, this is the di↵erence between an x 2 R and FL(x) 2 Rb(t, s). Typically measured relative to x as x FL(x) x = or FL(x) = x(1 ) where is the relative round o↵.