Database Issue Summary
Problem Summary
Chandler needs to store data across successive invocations of the program. This data needs to be conveniently and quickly available to the program, both remotely and locally. This discussion is about the database, which stores the actual Chandler items. The schema of the items are the data model, which is addressed elsewhere.
Complete History
We've discussed the issue at length in many meetings and the topic has been exhaustively written about:
Much of this documentation is wordy, lacking in clarity and a distinction between wishes and requirements, but it discusses the issues completely.
Issue Summary
- While a general conensus seems to exist, that is not obvious by reading the existing documentation.
Risks
- Our entire project builds upon the foundation of the database
- If we have to build our own database instead of using or adapting an existing solution, we might significantly increase the scope of our project
- If we don't build confidence that we know what we're doing, we could hamper potential efforts that don't bear fruit quickly.
- Our requirements may be so extensive that no pratical solution exists.
- Traditional database solutions are inadequate for our needs.
- This project may require more than one person.
Known Answers
- Data issues now seem to be well divided between Data Model, Python API, PIM Schema and Repository, with each area having an owner
- In the past, there has been general consensus about much of the architecture:
- We have a backend database most likely built on a reliable, existing open source project (e.g. SleepyCat). This database may be either remote or local to the machine Chandler is running on, but in either case it lives in its own process
- The application communicates to this backend database via the RAP protocol.
- An object cache exists between the application and RAP.
- There are two main ways objects in the database are accessed: transparently as Python objects (which can be complete or partial), or via iterators which provide sequential or random access to objects that are too large to load at once.
- A query API exists that returns a list of object references which, when large, may be accessed with an iterator
- When two or more users modify the same object in the database, the user resolves the collision, instead of the computer.
- A mechanism to preload objects, synchronously or asynchronously.
- Since much of Chandler's access to data is via python objects, it insulates us from changes to the object cache, the protocol, and the backend database. To the extent that we can standardize on a simple API for iterators, we could insulate this access as well.
- ZODB, an open source project, seems (according to John) to be most of what we need. If we could modify ZODB for our use, or hire ZODB experts that could adapt it for us, it might solve our problem.
- We know we need the following the features
- Access to data much larger than memory.
- Efficient network access.
- Indexing, including fulltext and a query mechanism.
- References and associated strategies that don't burden the application with housekeeping functions like garbage collection.
- A mechanism for removing unused data from the database, e.g. compaction.
- Schema verification and a strategy for schema evolution.
- Transactions.
- Multiple connections to a database.
- Replication.
- Access control lists (ACLs).
- It would be nice to have the following the features
- Transactions for multiple levels of undo
- Automatic backup and/or corrupted database recovery
- A rich set of reference types including weak references.
- Automatic data upgrading/downgrading to different schemas versions.
- A simple way to move databases between machines a different buy orders
Largest Unknowns
- The exact query API
- The indexing architecture
- Synchronization strategy
- The form of application notification of changes to data, including collision resolution.
- Iterator API
- Overhead, efficiency and performance
- ACL architecture
Dependencies
--
JohnAnderson - 08 May 2003
Discussion
the documentation referred to is wordy, and lacking in clarity, and lacking in distinction between wishes and requirements (as is much of our documentation).
our high level data goals that have driven us to our solution are not clearly spelled out
the risks of the various ways of achieving these goals are not clearly spelled out
our plan for addressing the risks for our choice is not clearly spelled out
if am am wrong, the document should at least contain clear pointers to this information.
if i am right, these questions belong in the "Issue Summary" section of this document
--
MichaelToy - 19 May 2003
I have to agree about what seemed to be an effort at early documentation. I can see it being foistered on users -- (Subtext) "Here, I struggled with this stuff, now you have to!"
To quote
RussellBeattie? 's recent critique of Cocoon:
"To me all this stuff we spend so much time doing is very, very simple: There's data in the database. You need to grab it and show it to someone so they can do something with it. They read, they write, they delete. That's it. The more crap you put between you and the database is just bad... for each layer, there better be a damn good reason for it."
Russell Beattie Notebook
I think you need to stop trying to be so smart and try being dumb for a change. If
RichardFeynman were around, he would tell you the same thing only not as nicely.
--
JonathanSmith - 20 May 2003
"If we don't build confidence that we know what we're doing, we could hamper potential efforts that don't bear fruit quickly."
I think:
- When fear replaces vision as motivation for collaborative effort, then you are in trouble
- If you are looking for somebody to "take charge and push it through" it will cost you in the end
- If the answers are not flowing freely, then discussion was cut off too soon
Committing to ZODB meant posting a big sign -> ROAD CONSTRUCTION AHEAD, e.g.,
ERP5, yet you are unhappy that there is a lack of smooth database highway on which Chandler can cruise. Sounds like an
IndustryLearningExperience. In other words
LetsUseAnObjectOrientedDatabase is more than a notion.
A year ago Bill Seitz and Paul Snively had an interesting discussion about some of the issues you still face. What happened? Where are the interesting discussions about databases and the Web? Are any of the semantic web thinkers stopping by OSAF to give brief talks on their visions? Who might Aleks, John, Michael, Morgen or Rys suggest as someone to stimulate thinking? How about:
- MarkBaker?
- DavidBeckett?
- TimBray?
- DanBrickley?
- RaelDornfest?
- EddDumbill?
- RoyFielding?
- RamanathanGuha?
- DavidHart?
- Uche Ogbuji
- SeanPalmer?
- CarlosPerez?
- ClayShirky?
- DaveSifry?
- AaronSwarz?
- DaveWiner?
To whom do the above people look for inspiration about web databases? Does anyone know of a staff member from FGDC -- Federal Geographic Data Committee to add to the above list? How about that person's European Union equivalent?
What would it cost to have a guerilla team do a PHP/MySQL hack in the meantime?
Lastly, is it priorities or fear that stops you from adding an RSS feed for this wiki?
--
JonathanSmith 21 May 2003
Dave Winer in his blog entry,
"Is It The Syntax?" notes:
"I've been reading Tim Bray's recent
RDF writings , and responses from
SjoerdVisscher and
Main.DannyAyers , with pleasure. They're all doing a great job of arguing their various positions. A fantastic demo of what's possible when discourse leaves the mail lists and heads to the Web."
Gosh, silly me, I thought that was what this wiki was about!
--
JonathanSmith - 23 May 2003