Professional Web Applications Themes

Problem loading data files having null values. - MySQL

I had started another thread focussed on data truncation errors when loading my data feed. While I learned something, it didn't solve the problem. On reading further, I am now wondering if the problem is related to how "LOAD DATA" handles null values. You see, the file contains tab delimited values, and if a field is empty, that is reflected in the file as two consecutive tabs. But on reading this page: [url]http://dev.mysql.com/doc/refman/5.0/en/load-data.html[/url] It looks like MySQL doesn't know how to handle this. Instead, it looks like is requires something to be between the tabs, such as \N or null. ...

  1. #1

    Default Problem loading data files having null values.

    I had started another thread focussed on data truncation errors when
    loading my data feed. While I learned something, it didn't solve the
    problem.

    On reading further, I am now wondering if the problem is related to how
    "LOAD DATA" handles null values. You see, the file contains tab
    delimited values, and if a field is empty, that is reflected in the
    file as two consecutive tabs. But on reading this page:

    [url]http://dev.mysql.com/doc/refman/5.0/en/load-data.html[/url]

    It looks like MySQL doesn't know how to handle this. Instead, it looks
    like is requires something to be between the tabs, such as \N or null.

    Have I misunderstood this, or is LOAD DATA really that crippled? I
    mean, in those applications where I have written my own code for
    loading and saving data, I routinely used consecutive tabs (or commas,
    for csv files) to represent missing values. Parsing that is trivially
    easy using C++. I did it so often, I eventually wrote several utility
    functions (generic functions intensively using the STL) to do it for
    me.

    If my hunch is right, then I should be able to fix my problem by
    writing a data sanitizer to insert \N in those fields where values are
    missing (and fixing date formats where necessary).


    As an aside, the data provider has done a bad job of quality control
    because in the date fields, I have seen several formats so far. I am
    surprised that there are people out there who don't take the time to do
    it right. Would you believe, another data provider has defined the
    data entirely as variable length strings regardless of the fact most of
    the data is numeric and a significant portion of the rest are dates.
    There are even several boolean fields represented as CHAR(1)! I would
    be ashamed to use such definitions in such cirstances!

    Ted

    Ted Guest

  2. #2

    Default Re: Problem loading data files having null values.

    I don't know if this helps but did you look at

    [url]http://asktom.oracle.com/pls/ask/f?p=4950:8:::::F4950_P8_DISPLAYID:533222350291[/url]

    strawberry Guest

  3. #3

    Default Re: Problem loading data files having null values.

    LOAD DATA is not crippled, but the doentation is a bit hard to read,
    especially for frequently-encountered problems like this. As I
    mentioned in my reply to your other thread, your empty fields are
    converted to 0 for decimal columns. If you want NULL instead, you must
    indeed replace empty input fields by \N (with no quotes, unless you
    have a FIELDS ENCLOSED BY clause).

    Here's the relevant excerpt of the manual:
    --- begin quote ---
    An empty field value is interpreted differently than if the field
    value is missing:
    * For string types, the column is set to the empty string.
    * For numeric types, the column is set to 0.
    * For date and time types, the column is set to the appropriate
    "zero" value for the type. See Section 11.3, "Date and Time
    Types".

    These are the same values that result if you assign an empty string
    explicitly to a string, numeric, or date or time type explicitly in an
    INSERT or UPDATE statement.
    --- end quote ---

    Also, to skip over records which yield a warning or an error, just add
    the "IGNORE" keyword after your "LOAD DATA INFILE" command.

    Good luck!

    JFG

    JFG Guest

  4. #4

    Default Re: Problem loading data files having null values.

    Thanks. It is quite interesting, and suggests even Oracle has issues
    with NULL values. The thing is that MySQL defaults to using tabs as
    field delimiters, but it doesn't behave quite correctly. The only
    options I have found so far are to write a little program to sanitize
    the data, correcting inconsistencies in date format and insert "\N"
    into empty fields, on the one hand, and to use "IGNORE" in the LOAD
    DATA statement on the other.

    I don't find either option quite acceptable, even though I must use one
    or the other. It is a waste of time to have to write extra code to
    handle what ought to be a routine data stream parsing issue, on the one
    hand, and in my experience with application programming, it is often a
    mistake to ignore warnings if one wants god quality, robust code.

    But, one does what one can with the tools at his disposal.

    Thanks.

    Ted

    Ted Guest

  5. #5

    Default Re: Problem loading data files having null values.

    Thanks for your help JFG.

    I do appreciate it.

    I must disagree with you though regarding whether or not LOAD DATA is
    crippled. I do not mean to disparage it. It is a priceless resource
    WRT the speed of loading data. However, in terms of handling missing
    data, it is crippled. If data is missing, then NULL is the most
    obvious, and usually correct, choice. Sometimes, other choices can be
    useful, such as using an integer to represent natural numbers and -1 to
    represent missing values, but in all of the programming I've done, the
    norm has been to use null in memory, and to leave nothing between the
    delimiters used in a file for data stored on disk. NULL is definitely
    not the same thing as an empty string or 0. When loading data from a
    text file, the default behaviour ought to be to treat consecutive
    delimiters as indicating the field is empty and to assign to it a null
    value. And of course, that default should be something that can be
    over-ridden as needed. But that is just my opinion based on my
    experience working with data.

    Given what you have quoted, though, it seems that LOAD DATA is not only
    crippled but broken since it does not seem to do what the material you
    quoted says. Instead of setting the values in question to an empty
    string or to 0 and moving on, it just stops the load of the data
    altogether and gives the error message. Or is this a case of relevant
    material hiding elsewhere in the doentation?

    What is the rationale for treating an empty field value differently
    from when the field value is missing? I am not sure I even understand
    how an "empty field value" is different from a "missing field value."
    Perhaps if that distinction can be made clear, and how that factors
    into the various options for handling missing data when loading data,
    it may be easier to understand why MySQL does what it does in the LOAD
    DATA statement.

    Thanks again,

    Ted

    Ted Guest

  6. #6

    Default Re: Problem loading data files having null values.

    Ted wrote:
    > When loading data from a
    > text file, the default behaviour ought to be to treat consecutive
    > delimiters as indicating the field is empty and to assign to it a null
    > value.
    I really do not want this to be the default behavior.

    Defaulting an empty string to NULL results in a real error if the column
    it's inserting into has a NOT NULL constraint. However, it's not an
    error to put a zero-length string into such a column. It is appropriate
    that the default behavior is that an empty string means an empty string.

    There will always be tricky behavior regarding NULL vs. empty string,
    because NULL is not a value, it's a state. It really deserves to have
    some kind of flag separate from the value, and in fact in RDBMS storage,
    there is. But when loading records from a text file, there are only
    values; there is no place to keep a NULL flag.

    The doc for LOAD DATA says that things are simplified if you load data
    using FIELDS ENCLOSED BY. Then you can have NULL in your text file, as
    distinct from "NULL", and the difference is more easily recognized by
    the LOAD DATA pr.

    When not using FIELDS ENCLOSED BY, it's difficult for the pr to
    recognize the difference between NULL and the literal string "NULL". So
    they introduced that backslash-sequence \N to try to do it. However, I
    notice that there are a few bugs on file at bugs.mysql.com related to
    this. So their implementation is not perfect. There are also confusing
    cases, like if you change the FIELDS ESCAPED BY character.

    Regards,
    Bill K.
    Bill Karwin Guest

  7. #7

    Default Re: Problem loading data files having null values.

    Hi Bill,

    I can understand your position, although I am not sure I'd agree.
    Rather, thinking of what the data, or lack of it, means, it seems to me
    that an empty string means that there is no information about what the
    value really is, and so it means the same thing as a null value. It
    seems to me then that null means the same thing as an empty string. If
    a field is defined to be NOT NULL, allowing an empty string in a NOT
    NULL field hides a problem in the data. Either the definition of the
    field as NOT NULL is a mistake, or the use of an empty string is a
    mistake. Either way, it is a problem to be examined and rectified,
    possibly by specifying a default value in the field definition if a
    case can be made for a default should real data be missing (a default
    value of an empty string should also handle the situation you
    describe). As I understand it, a field is null if there is no data for
    it, and it is not null if there is data for it, so being null is
    perfectly correlated with the availability of data and I see, at
    present, no value in providing extra storage for it. It can be
    inferred by the presence or absence of data in the field. I'd agree.
    therefore that an empty string means an empty string, but it appears to
    me that it ALSO means the same thing as null.

    I had tried using FIELDS ENCLOSED BY '\t', realizing that is the
    default, but alas that did not help me. :-(

    Cheers,

    Ted

    Ted Guest

  8. #8

    Default Re: Problem loading data files having null values.

    Ted wrote:
    > Hi Bill,
    >
    > I can understand your position, although I am not sure I'd agree.
    > Rather, thinking of what the data, or lack of it, means, it seems to me
    > that an empty string means that there is no information about what the
    > value really is, and so it means the same thing as a null value. It
    > seems to me then that null means the same thing as an empty string. If
    > a field is defined to be NOT NULL, allowing an empty string in a NOT
    > NULL field hides a problem in the data. Either the definition of the
    > field as NOT NULL is a mistake, or the use of an empty string is a
    > mistake. Either way, it is a problem to be examined and rectified,
    > possibly by specifying a default value in the field definition if a
    > case can be made for a default should real data be missing (a default
    > value of an empty string should also handle the situation you
    > describe). As I understand it, a field is null if there is no data for
    > it, and it is not null if there is data for it, so being null is
    > perfectly correlated with the availability of data and I see, at
    > present, no value in providing extra storage for it. It can be
    > inferred by the presence or absence of data in the field. I'd agree.
    > therefore that an empty string means an empty string, but it appears to
    > me that it ALSO means the same thing as null.
    >
    > I had tried using FIELDS ENCLOSED BY '\t', realizing that is the
    > default, but alas that did not help me. :-(
    >
    > Cheers,
    >
    > Ted
    >
    Ted,

    FWIW. I agree with Bill.

    There is a definite difference between an empty string and an absence of data.
    NULL means the value does not exist. An empty string means it exists, but
    contains nothing.

    For instance - and employee database may have a null value in the employee_id
    field, indicating this employee has not been assigned an id (possibly a new
    employee who has not made it completely through the process). Or, a reports_to
    field which may be NULL (currently reporting to no one - changing managers?).

    This is different than a second line of an address, which may be empty (i.e no
    apartment number). The second line of the address is there - but it doesn't
    contain anything

    This is a common theme in all relational databases; distinguishing between the
    two can be very important.



    T

    --
    ==================
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    [email]jstucklexattglobal.net[/email]
    ==================
    Jerry Stuckle Guest

  9. #9

    Default Re: Problem loading data files having null values.

    Ted wrote:
    > that an empty string means that there is no information about what the
    > value really is, and so it means the same thing as a null value.
    Well, that's fine, you're entitled to your opinion like anyone else, but
    the ANSI SQL standard defines NULL and an empty string to be two
    different things. So it doesn't matter what you or I think it should
    be! :)

    According the standard, an empty string _is_ a legal value for data.
    It's not NULL, and does not imply NULL. It may mean something
    equivalent to "absence of data" for some columns in some applications.
    In fact, this might be pretty common.

    But if there are any applications for which a zero-length string value
    is significant and is distinct from "no value", then the language should
    support that.

    Another reason an empty string value is not NULL is to preserve
    orthogonality in the language. In other SQL datatypes such as integers,
    no single value in the domain is the same as NULL. That is, not 0 nor
    999 nor -MAXINT is an integer value that signifies NULL.

    That's the point of NULL, to separate the characteristic of "no value"
    from the domain of values, so that all the legitimate values that we
    might need in our applications are available to us. That's why the
    mantra is: NULL is a state, not a value.

    So an empty string, being a value, _can't_ signify NULL for string
    columns, any more than 0 can signify NULL for integer columns.
    > I had tried using FIELDS ENCLOSED BY '\t', realizing that is the
    > default, but alas that did not help me. :-(
    FIELDS ENCLOSED BY is different from FIELDS TERMINATED BY. Commonly,
    ENCLOSED BY is used for quote characters. In the following line, the
    fields are ENCLOSED BY '"' TERMINATED BY '\t'.

    "one"\t"two"\t"three"

    Regards,
    Bill K.
    Bill Karwin Guest

  10. #10

    Default Re: Problem loading data files having null values.

    Thanks Bill

    I have not read the ANSi SQL standard. Does it give the two reasons
    you give as the reason for the treatment given NULLs. I ask because in
    the problem domains in which I normally work, the meaning I describe is
    the norm. Hence it s hard for me to see a situation in which an empty
    string is useful or meaningful, having ameaning different from null
    with regard to useful data. My background in in natural science and
    so, given that for most variables that can be measured, there are
    infinitely many mathematically possible values hat are physically
    impossible, and so almost all of the numeric methods with which I am
    familiar just use one of the physically impossible values (such as
    negative mass, or a negative integer for a population count) to
    indicate missing data. For the usual purposes of statistical ysis
    in this context, it only matters if the variable has been measured or
    not, and if so what is the value. Then, the yses can be applied
    only to records having complete data for the model being assessed.

    I don't mean to be argumentative. Rather, I am just struggling to
    understand what strikes me as odd; well out of my range of experience.

    Also, for me, a DB product such as MySQL is just a tool, a means to an
    end. My primary interest is in what yses I can perform and what I
    can therefore learn from the data. Perhaps an exmple where an empty
    string has a useful meaning in the context of data ysis that is
    independant of the meaning null would have. Perhaps my imagination is
    limited, but I can't yet imagine a situation in which an empty string
    means anything other than a dearth of data.

    Thanks again,

    Ted

    Ted Guest

  11. #11

    Default Re: Problem loading data files having null values.

    >I have not read the ANSi SQL standard. Does it give the two reasons
    >you give as the reason for the treatment given NULLs.
    Does the ANSI SQL standard govern the operation of mysqlimport?
    I don't think so. However, it does give a distinction between
    SQL NULL and an empty string, and it should.
    >I ask because in
    >the problem domains in which I normally work, the meaning I describe is
    >the norm.
    In the problem domains in which you work, is string data meaningful
    at all?
    >Hence it s hard for me to see a situation in which an empty
    >string is useful or meaningful, having ameaning different from null
    >with regard to useful data.
    What dialing prefix do you dial on your phone to get an outside
    line? On *most* phones, this is an empty digit sequence, but on
    business PBX systems, it might be 9 or some other code. It
    is not NULL, which would suggest you can't even *GET* an outside line.

    What suffix do you use on a root word to make the plural?
    In English, for a lot of words, it's 's'. But the plural of 'deer'
    is 'deer'. How do you enter the suffix? I'd use an empty string.
    NULL suggests that you can't use the suffix method, e.g. the
    plural of mouse is mice. Spelling checker programs use this kind
    of stuff.

    NULL is often used to indicate *missing* data (which may cause you
    to dispatch someone to go measure it or look it up). What is the middle
    initial of a person who doesn't have a middle name? It's not that
    you don't KNOW, it's that there isn't one. A null string used for
    the middle initial looks reasonable when you try to print it.

    What title do you use for a person? Usually, it's Mr., Mrs., or
    Miss, but some cultures don't use them and consider them insulting,
    so you don't want to use one sending mail to them. For these people,
    the data is NOT missing. And the title used sure looks like a null
    string when it's printed on an envelope.

    >My background in in natural science and
    Does natural science even USE strings as variables?
    >so, given that for most variables that can be measured, there are
    >infinitely many mathematically possible values hat are physically
    >impossible, and so almost all of the numeric methods with which I am
    >familiar just use one of the physically impossible values (such as
    >negative mass, or a negative integer for a population count) to
    >indicate missing data.
    That can get embarassing when the physically impossible value turns
    out to be possible. Didn't people used to think that you couldn't
    travel faster than the speed of sound?

    It also means you have to know the data type of the variable in
    order to figure out whether the data is missing (something which I
    suspect may be accident-prone, especially if combined with roundoff
    error). And merging data from two sources that used different
    "impossible" values can be messy.
    >For the usual purposes of statistical ysis
    >in this context, it only matters if the variable has been measured or
    >not, and if so what is the value. Then, the yses can be applied
    >only to records having complete data for the model being assessed.
    How do you apply such ysis to, for example, the name of the owner
    of a given square meter of the Earth's surface? What do you do with
    that value for ocean, which has no owner, but the data is *not* missing?
    I'd be interested in knowing how you compute the average and standard
    deviation of an owner's name.
    >I don't mean to be argumentative. Rather, I am just struggling to
    >understand what strikes me as odd; well out of my range of experience.
    >Also, for me, a DB product such as MySQL is just a tool, a means to an
    >end. My primary interest is in what yses I can perform and what I
    >can therefore learn from the data. Perhaps an exmple where an empty
    >string has a useful meaning in the context of data ysis that is
    >independant of the meaning null would have. Perhaps my imagination is
    >limited, but I can't yet imagine a situation in which an empty string
    >means anything other than a dearth of data.
    You are doing a survey of the types of trees in an area. In this
    field, you are supposed to enter a list of the types of trees found
    in each grid square (Yes, this is poor normalization.) What do
    you enter for the grid square at the center of Lake Benbrook, which
    has no trees? Keep in mind that you do NOT have a list of all
    tree names, and that joker you work with threatened to name the next
    tree species he discovered the "none" tree. Also keep in mind that
    they made you rent a helicopter to go make sure there really are
    no trees in the center of Lake Benbrook, and you don't want them
    to refuse your expense report because you didn't get the data.

    Gordon L. Burditt
    Gordon Burditt Guest

  12. #12

    Default Re: Problem loading data files having null values.

    Ted wrote:
    > Thanks Bill
    >
    > I have not read the ANSi SQL standard. Does it give the two reasons
    > you give as the reason for the treatment given NULLs.
    I admit, I don't have a current copy of the ANSI standard. I learned
    SQL from books that paraphrase the standard, including "Understanding
    the New SQL: A Complete Guide" by Melton & Simon (covers the SQL-92
    standard) and "SQL-99 Complete, Really" by Peter Gulutzan and Trudy
    Pelzer. I've also read a draft copy of the SQL-92 spec. But the final
    doent costs hundreds of dollars, so I don't have it.

    So I'll get out the latter book, and describe the reasoning presented by
    Gulutan & Pelzer.

    They make a point that NULLs _are_ a value, because a null has a
    datatype. The NULL is in the domain of values of whatever column you
    put it in, string, numeric, etc. NULL must have a datatype, because you
    can put it in a column, and every value in a given column must have the
    uniform datatype. You _can't_ say "SELECT NULL FROM sometable" because
    the DBMS cannot infer the datatype (columns of a result set must have a
    datatype).

    If you compare NULL to any other value, you can't say that it's equal
    to, greater than, or less than that value. This is not true for an
    empty string; there's definitely an answer to whether an empty string is
    equal to some other string.

    You can't represent a NULL with a literal, because then you'd have
    something you could compare to other literals, and say something about
    it being greater than or less than, etc. Their book gives an example
    using integers, I suppose because it's easier to show the case for.

    They also make the point that NULL is used for two principle cases:
    first, when the value is unknown, or unspecified, pending, etc. Second,
    when the value is inapplicable or irrelevant in the context.

    Further, they talk about what NULL _doesn't_ mean: it doesn't mean "not
    a number" (NaN). It doesn't mean zero, despite languages like C and
    Java which treat 0 and false and null as logically equivalent, and may
    even use the symbol null for the value zero. And it isn't '' (the empty
    string). This has often been used in the past for "unknown"s.
    > it's hard for me to see a situation in which an empty
    > string is useful or meaningful, having a meaning different from null
    > with regard to useful data.
    It is harder to think of a reasonable example where the difference
    between a string NULL and an empty string is important, than it is to
    think of an example where integer NULL and zero are different. Here's
    what I came up with:

    Say you're doing a JOIN of two tables, people and tax returns, using a
    string column for the join.

    SELECT p.firstname, p.lastname, t.taxrefund
    FROM persons AS p JOIN taxes AS t
    ON p.ssn = t.ssn

    The SSN (social security number) field is a string, and you've got tax
    records from way back before the IRS required SSN on every form, so some
    of the rows are missing values in the ssn columns. Likewise, there are
    some people who don't have SSN's assigned, because they lived in the
    19th century or something.

    Now, if you use '' as the signifier for missing data in both cases, the
    JOIN will incorrectly match all those civil war veterans with the tax
    records that lack ssn values, because '' = '' is TRUE. In fact, it'll
    almost be like a Cartesian product, because every civil war veteran will
    be matched to every such tax record!

    However, if you used NULL to signify missing data, the JOIN comparison
    of NULL = NULL is always FALSE, so these records will not appear in the
    result set at all.

    This may or may not be what you want in a given situation. But the
    point is that NULL and '' behave differently, because of SQL's
    definition that NULL compared to anything, including another NULL, is FALSE.
    > almost all of the numeric methods with which I am
    > familiar just use one of the physically impossible values (such as
    > negative mass, or a negative integer for a population count) to
    > indicate missing data.
    The problem with that technique is that someday, someone (a computer
    program, most likely) might not be aware of your convention, and
    ignorantly multiply the negative mass by its accelaration, and get a
    really curious result for the force of some object. Probably causing
    the next Mars probe to plunge into the sun.

    Using an unused value in the domain of real numbers to signify "unknown"
    is useful if all you've got for notation is the set of real numbers.
    But it's a convention, and it's domain-specific. A different domain,
    say, temperature, might need to use that negative real number value for
    something meaningful. It would have to choose a different value that
    isn't used in _its_ domain. But that value might mean something to some
    other domain. You'd eventually have to publish a reference book,
    listing which value is conventionally used for "missing value" in each
    conceivable domain.

    Not to say that the use of NULL in SQL isn't also a convention. But at
    least the semantics of the language do support NULL as a value that
    cannot be used in comparisons, arithmetic, etc. Every domain and
    datatype treats NULL similarly in this respect, and consistency is
    pretty handy.

    Regards,
    Bill K.
    Bill Karwin Guest

  13. #13

    Default Re: Problem loading data files having null values.

    Gordon Burditt wrote:
    > What suffix do you use on a root word to make the plural?
    > In English, for a lot of words, it's 's'. But the plural of 'deer'
    > is 'deer'. How do you enter the suffix? I'd use an empty string.
    > NULL suggests that you can't use the suffix method, e.g. the
    > plural of mouse is mice.
    This is a good example of the other usage of NULL, which is to signify
    "inapplicable", instead of "unknown".

    Regards,
    Bill K.
    Bill Karwin Guest

  14. #14

    Default Re: Problem loading data files having null values.

    Bill Karwin wrote:
    > Ted wrote:
    >
    >> Thanks Bill
    >>
    >> I have not read the ANSi SQL standard. Does it give the two reasons
    >> you give as the reason for the treatment given NULLs.
    >
    >
    > I admit, I don't have a current copy of the ANSI standard. I learned
    > SQL from books that paraphrase the standard, including "Understanding
    > the New SQL: A Complete Guide" by Melton & Simon (covers the SQL-92
    > standard) and "SQL-99 Complete, Really" by Peter Gulutzan and Trudy
    > Pelzer. I've also read a draft copy of the SQL-92 spec. But the final
    > doent costs hundreds of dollars, so I don't have it.
    >
    > So I'll get out the latter book, and describe the reasoning presented by
    > Gulutan & Pelzer.
    >
    > They make a point that NULLs _are_ a value, because a null has a
    > datatype. The NULL is in the domain of values of whatever column you
    > put it in, string, numeric, etc. NULL must have a datatype, because you
    > can put it in a column, and every value in a given column must have the
    > uniform datatype. You _can't_ say "SELECT NULL FROM sometable" because
    > the DBMS cannot infer the datatype (columns of a result set must have a
    > datatype).
    >
    > If you compare NULL to any other value, you can't say that it's equal
    > to, greater than, or less than that value. This is not true for an
    > empty string; there's definitely an answer to whether an empty string is
    > equal to some other string.
    >
    > You can't represent a NULL with a literal, because then you'd have
    > something you could compare to other literals, and say something about
    > it being greater than or less than, etc. Their book gives an example
    > using integers, I suppose because it's easier to show the case for.
    >
    > They also make the point that NULL is used for two principle cases:
    > first, when the value is unknown, or unspecified, pending, etc. Second,
    > when the value is inapplicable or irrelevant in the context.
    >
    > Further, they talk about what NULL _doesn't_ mean: it doesn't mean "not
    > a number" (NaN). It doesn't mean zero, despite languages like C and
    > Java which treat 0 and false and null as logically equivalent, and may
    > even use the symbol null for the value zero. And it isn't '' (the empty
    > string). This has often been used in the past for "unknown"s.
    >
    >> it's hard for me to see a situation in which an empty
    >> string is useful or meaningful, having a meaning different from null
    >> with regard to useful data.
    >
    >
    > It is harder to think of a reasonable example where the difference
    > between a string NULL and an empty string is important, than it is to
    > think of an example where integer NULL and zero are different. Here's
    > what I came up with:
    >
    > Say you're doing a JOIN of two tables, people and tax returns, using a
    > string column for the join.
    >
    > SELECT p.firstname, p.lastname, t.taxrefund
    > FROM persons AS p JOIN taxes AS t
    > ON p.ssn = t.ssn
    >
    > The SSN (social security number) field is a string, and you've got tax
    > records from way back before the IRS required SSN on every form, so some
    > of the rows are missing values in the ssn columns. Likewise, there are
    > some people who don't have SSN's assigned, because they lived in the
    > 19th century or something.
    >
    > Now, if you use '' as the signifier for missing data in both cases, the
    > JOIN will incorrectly match all those civil war veterans with the tax
    > records that lack ssn values, because '' = '' is TRUE. In fact, it'll
    > almost be like a Cartesian product, because every civil war veteran will
    > be matched to every such tax record!
    >
    > However, if you used NULL to signify missing data, the JOIN comparison
    > of NULL = NULL is always FALSE, so these records will not appear in the
    > result set at all.
    >
    > This may or may not be what you want in a given situation. But the
    > point is that NULL and '' behave differently, because of SQL's
    > definition that NULL compared to anything, including another NULL, is
    > FALSE.
    >
    >> almost all of the numeric methods with which I am
    >> familiar just use one of the physically impossible values (such as
    >> negative mass, or a negative integer for a population count) to
    >> indicate missing data.
    >
    >
    > The problem with that technique is that someday, someone (a computer
    > program, most likely) might not be aware of your convention, and
    > ignorantly multiply the negative mass by its accelaration, and get a
    > really curious result for the force of some object. Probably causing
    > the next Mars probe to plunge into the sun.
    >
    > Using an unused value in the domain of real numbers to signify "unknown"
    > is useful if all you've got for notation is the set of real numbers. But
    > it's a convention, and it's domain-specific. A different domain, say,
    > temperature, might need to use that negative real number value for
    > something meaningful. It would have to choose a different value that
    > isn't used in _its_ domain. But that value might mean something to some
    > other domain. You'd eventually have to publish a reference book,
    > listing which value is conventionally used for "missing value" in each
    > conceivable domain.
    >
    > Not to say that the use of NULL in SQL isn't also a convention. But at
    > least the semantics of the language do support NULL as a value that
    > cannot be used in comparisons, arithmetic, etc. Every domain and
    > datatype treats NULL similarly in this respect, and consistency is
    > pretty handy.
    >
    > Regards,
    > Bill K.
    Bill,

    All I can say is those authors don't understand NULL. Null is not a value, and
    cannot be compared to anything.

    For instance - the following are all false:

    NULL = 1 (Of course)
    NULL = 0 (Yep)
    NULL = NULL (!!!!)

    To test for null you have to say something like

    WHERE colname IS NULL

    As for choosing a value which isn't in the specific domain - all that is is a
    workaround for the lack of a NULL value.

    --
    ==================
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    [email]jstucklexattglobal.net[/email]
    ==================
    Jerry Stuckle Guest

  15. #15

    Default Re: Problem loading data files having null values.

    Peter H. Coffin wrote:
    > On Sat, 29 Apr 2006 08:55:44 -0400, Jerry Stuckle wrote:
    >
    >>Bill,
    >>
    >>All I can say is those authors don't understand NULL. Null is not a value, and
    >>cannot be compared to anything.
    >>
    >>For instance - the following are all false:
    >>
    >> NULL = 1 (Of course)
    >> NULL = 0 (Yep)
    >> NULL = NULL (!!!!)
    >
    >
    > No, they are NOT false.
    >
    > mysql> select 1 = 0;
    > +-------+
    > | 1 = 0 |
    > +-------+
    > | 0 |
    > +-------+
    > 1 row in set (0.00 sec)
    >
    > That is false. The above examples are NULL, which is not false.
    >
    > mysql> select NULL = NULL;
    > +-------------+
    > | NULL = NULL |
    > +-------------+
    > | NULL |
    > +-------------+
    > 1 row in set (0.01 sec)
    >
    > To demonstrate the difference further, true and false are testable:
    >
    > mysql> select (1 = 0) = 0;
    > +-------------+
    > | (1 = 0) = 0 |
    > +-------------+
    > | 1 |
    > +-------------+
    > 1 row in set (0.01 sec)
    >
    > That is, it is true that 1=0 is false.
    >
    > mysql> select (NULL <> NULL) = 0;
    > +--------------------+
    > | (NULL <> NULL) = 0 |
    > +--------------------+
    > | NULL |
    > +--------------------+
    > 1 row in set (0.00 sec)
    >
    > mysql> select (NULL <> NULL) = 1;
    > +--------------------+
    > | (NULL <> NULL) = 1 |
    > +--------------------+
    > | NULL |
    > +--------------------+
    > 1 row in set (0.00 sec)
    >
    > The NULL <> NULL is not true, and NULL <> NULL is not false.
    >
    > mysql> select (NULL <> NULL) IS NULL;
    > +------------------------+
    > | (NULL <> NULL) IS NULL |
    > +------------------------+
    > | 1 |
    > +------------------------+
    > 1 row in set (0.00 sec)
    >
    > It is true that NULL <> NULL is NULL.
    >
    >
    >>To test for null you have to say something like
    >>
    >> WHERE colname IS NULL
    >>
    >>As for choosing a value which isn't in the specific domain - all that is is a
    >>workaround for the lack of a NULL value.
    >
    >
    > This is correct.
    >
    Correct - I should have said "not true". Any comparison with NULL produces a
    NULL result.

    --
    ==================
    Remove the "x" from my email address
    Jerry Stuckle
    JDS Computer Training Corp.
    [email]jstucklexattglobal.net[/email]
    ==================
    Jerry Stuckle Guest

  16. #16

    Default Re: Problem loading data files having null values.

    Gordon,

    While the examples you give illustrate a theoretical difference between
    empty strings and null, they are lacking something that is present in
    Bill's example. In each of your examples, if I were writing
    application code for them, I could handle them with a simple practice
    of "use the data if present and ignore the field if it isn't". In
    Bill's example, one gets incorrect results if the theoretical
    difference between null and an empty string is ignored. That is what I
    was after.

    BTW, in numerical taxonomy, and sometimes other subdisciplines in
    ecology and biogeography, we do make use of qualitative data, and for
    these data, strings are often appropriate. And while it is absurd to
    consider applying a mean to an address or a name, there are many
    different yses designed to work with qualitatibve data. Further,
    there are a number of functions that assume a value of 1 if a certain
    condition is true and 0 otherwise. One must always be sure to use
    yses that are defined on the kind of data you have. Unfortunately,
    many people use the wrong ysis, either because it is inapprpriate
    for either the questions to be answered or the kind of data one can
    obtain, but that takes us well beyond the scope of this forum.

    For your tree survey, the answer to the question "What do you enter for
    the grid square at the center of Lake Benbrook, which has no trees?, is
    "nothing would be entered for the middle of a lake since such a
    location would never be used in the sample design." That is, unless
    there is an island in the middle of the lake Any competent forest
    ecologist will either have as complete a flora as exists for the region
    in which he is working, or he'll hire, or consult with, someone who
    knows the region in which he'll be working. But that is beside the
    point. If he encounters a new species, he'll just give it an
    identifier, take appropriate samples, and have it identified later. He
    can still record data representing the abundance of each species. All
    species for which abundance can not be measured would be assigned a
    value of 0. So then, the data collected in the survey will never have
    null values since those trees found on a given site will have a
    positive value representing a population count or density while those
    not found on a site would have values of 0. And it would not matter if
    one of the unknown specimens was called the 'none' tree, by a joker
    working on the team, since each specimen that he doesn't recognise will
    be given a provisional but unique identifier. One need not, and ndeed
    can not, have a complete list of trees. One adds unique identifiers
    for anything new that is encountered, and that has little impact on how
    one would record population densities or population size. The data
    collected in pant ecology, whether qualitative or quantitative, is
    almost never normalized, although in days gone by, when computr storage
    was much more limited, some quantitative plant ecologists invested much
    time and resources into procedures for storing more data in less space,
    and in retrospect, some of the things we did did bear some resemblance
    to normalization even though few if any of us would never have
    described it in those terms.

    The reason a rational sampling strategy would exclude sites such as the
    middle of the lake is that the data collected there would be really
    boring, having values of 0 for all tree species. But if we translocate
    your example to a place such as the far northern parts of Ontario, in
    the Hudson Bay lowland, the process would be somewhat different since
    one can find bog plants growing out over the lake, and, while one
    wouldn't find large trees on the quaking bog, one can find moderate
    sized shrubs and other plants that will grow in wetlands. But this too
    takes us well beyond the scope of this forum.

    Cheers,

    Ted

    Ted Guest

  17. #17

    Default Re: Problem loading data files having null values.

    Thanks Bill,

    That is what I was looking for. You gave an example where null and
    empty strings behave differently. All of the other examples I have
    seen related ot this could be handled with suitable application code
    (in a client of the DB backend). In all the examples involving client
    aplication code, null and empty strings can be made to behave in the
    same way.

    The last problem you mention above is normally handled by doenting
    the data that has been stored, and this includes the conventions used.
    This kind of doentation has been available for all of the
    environmental data sets to which I have had access.

    Thanks again.

    Ted

    Ted Guest

  18. #18

    Default Problem loading data files having null values.

    A year later.

    Thanks Ted for your post which led me to the right process on my end.

    I fear that this original post went way off track into what a NULL is
    and lost sight of the original point - treatment of empty fields
    during LOAD. It really does not matter how you feel about how a NULL
    should be represented, it matters about how open you can be and accept
    what other vendors are providing in load files to represent NULL. In
    a pure Windows environment (with no unix tools), I'm stuck with a
    problem of converting empty fields to \N. Any loader worth it's salt
    should care more about dealing with a basic problem like this as
    opposed to an esoteric one like 'record start' (first time I've ever
    seen that in a loader).

    j.

    jack.parker4@verizon.net Guest

Similar Threads

  1. problem in loading webservice data in datagrid
    By anitha in forum Macromedia Flex General Discussion
    Replies: 0
    Last Post: April 5th, 07:13 AM
  2. library of flash files, loading, and data assignment?
    By jeturcotte in forum Macromedia Flash Data Integration
    Replies: 0
    Last Post: December 20th, 04:35 PM
  3. Please HELP! Problem with NULL values...
    By tacojohnellenich in forum Macromedia Dynamic HTML
    Replies: 0
    Last Post: June 11th, 06:55 PM
  4. flashplayer problem in 2000 loading external .xml files
    By Wes Tatters in forum Macromedia Flash
    Replies: 4
    Last Post: March 18th, 04:12 AM
  5. Data::Dumper && null values
    By Sherm Pendley in forum PERL Modules
    Replies: 1
    Last Post: February 7th, 06:02 PM

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139