Saturday 27 January 2018

Pl - sql الحركة من المتوسط


محتويات ويليامروبيرتسون إعداد بلسكل المطور، الجزء 2 تمت كتابة هذه المقالة ل بلسكل المطور 8.0.4 باستخدام أوراكل 11.2 و ويندوز زب، في آلة افتراضية متوازية على بلدي ماك، وهذا هو السبب في لقطات تظهر خليط من زب الفضة و أكوا النوافذ. بلسكل المطور هو واحد من عدة بيئات التنمية المتكاملة (إيدس) التي تتوفر لأوراكل. واحدة من الأشياء التي أحبها هو كيف شكلي هو - يمكنك تغيير أي شيء تقريبا، ومع تحميل المكونات الإضافية مثل متصفح موسع يمكنك إضافة وظائف الخاصة بك. بعد نقل أجهزة الكمبيوتر عدة مرات وبعد الاضطرار إلى إعادة تثبيت مطور بلسكل مرة أخرى في كل مرة، لقد وجدت أن هناك بعض كوستوميساتيونس أنا لم أستطع العيش من دونه، واعتقدت إد وثيقة لهم. الجزء الأول يغطي الأفضليات مثل الخطوط وتخطيط الشاشة، ويغطي الجزء 2 متصفح الجلسة. كنت ذاهبا لتشمل الإجراءات الحق-موسكليك العرف التي أضافها إيف باستخدام موسع المتصفح. ولكن هناك الكثير يمكنك القيام به مع متصفح الدورة التي سوف تضطر إلى ترك ذلك لجزء 3. تمديد متصفح الدورة 1. جعل من السهل العثور على متصفح الجلسة في التخطيط الافتراضي، وطريقة دفن أسفل القائمة تحت قائمة الأدوات، ولكن كشيء ما سوف تستخدم في كل وقت أفضل بكثير لديك زر لذلك. في حال فاتك في الجزء 1. يمكنك تخصيص شريط الأدوات عن طريق إضافة رمز لمستعرض الجلسة. هيريس نوع الشيء الذي يمكنك القيام به: شريط الأدوات الافتراضي شريط أدوات مخصص لاحظ رمز مفاتيح الصليب الثاني من اليسار في شريط الأدوات المخصص. 2. نظرة على الإعدادات الافتراضية الآن فتح مستعرض الجلسة وإلقاء نظرة على الإعداد الافتراضي. (في الواقع لم يكن الافتراضي تماما - إيف تغيير الخط إلى كوربيل 8pt، الذي يناسب المزيد من المعلومات على الشاشة، فضلا عن كونها أكثر جاذبية من الافتراضي في رأيي. تاهوما يعمل أيضا بشكل جيد، وسوف تبحث في هذه الشاشة كثيرا، بعد كل شيء.) الشاشة هو تقرير سيد والتفاصيل، مع كل من سيد والتفاصيل الاستفسارات شكلي باستخدام رمز مفتاح الربط. يتم تعريف الاستعلام الرئيسي في الجزء العلوي من الإطار ضمن عوامل التصفية، ويتم تقديم بعض الاختلافات على تحديد من فسسيون. تحت التفاصيل، هناك أربعة استعلامات أساسية جدا للمؤشرات المفتوحة، سكل الحالي، إحصائيات جلسة العمل والأقفال: لاحظ متغير ربط: سيد في الاستعلام مؤشرات. الشيء البارد حول استعلامات تفاصيل مستعرض جلسة العمل يمكنك الرجوع إلى القيمة الحالية لأي عمود من القسم العلوي كمتغير ربط في الاستعلامات التفصيلية. لذلك، طالما أن الاستعلام الرئيسي يحتوي على عمود يسمى سيد، يمكننا استخدام التعبيرات مثل حيث سيسيونيد: سيد في أي استعلام التفاصيل. (وهذا يعني، ومع ذلك، قد تحتاج إلى تضمين عدد قليل من الأعمدة في الاستعلام الرئيسي فقط لاستخدامها كمفاتيح في الاستعلامات التفصيلية). هناك نقطة أخرى لاحظت حول مربع الاستعلام التفصيلي هو أن إضافة النص تسلسل بعد الاستعلام يجعل بلسكل المطور الانضمام إلى جميع خطوط الانتاج في كتلة واحدة كبيرة. في حين أن ميزة أنيق، وهذا يمنع أيضا التمرير حتى أجد أنه نعمة مختلطة. 3. كتابة استعلامات فسسيون الخاصة بك جلسات نشطة يتم تحديد الاستعلامات الافتراضية كلها من فسسيون حيث. . والتي بالطبع هو الإعداد الافتراضي معقول من شأنها أن تعمل عبر جميع إصدارات أوراكل. يتم إضافة سمات جديدة ومفيدة إلى فسسيون في كل إصدار، وبالطبع ترميزها بشكل صريح في الارتباطات وعمليات البحث يعني أن الاستعلام قد لا يعمل في إصدار سابق. 1 إذا كنت تعمل مع إصدارات أوراكل متعددة، قد تحتاج إلى حفظ أكثر من استعلام واحد في قسم المرشحات، وحدد واحد المناسب حسب الحاجة (للأسف مطور بلسكل غير قادر على التحقق من الإصدار واختياره بالنسبة لك). هيريس أفضل استعلام جلسات نشطة ل أوراكل 10.2.0.2 فصاعدا (لاحظ الأعمدة بلسكلنتريوبجكتيد و بلسكلنتريسوبروغراميد من بين آخرين، تمت إضافتها في هذا الإصدار لذلك لن تعمل في أوراكل 10G ز). جميع الجلسات النشطة حاليا (باستثناء عمليات خلفية أوراكل مثل كاتب السجل)، أو التي تحظر جلسات أخرى، أو التي تملكها. الوالدان، إذا كان جزء من الاستعلام المتوازي الكائن الذي ينتظر حاليا ل (عادة جدول أو فهرس) - نظرت من دباوبجيكتس باستخدام روويتوبج. إدخال بلسل والإجراءات الحالية - نظرت حتى من دبابروسدوريس باستخدام أعمدة بلسكل المضافة في أوراكل 10.2.0.2. بعض الإحصاءات حول وحدة المعالجة المركزية، يقرأ، استخدام الذاكرة والاستعلام تحليل، من فسميتريك. عند عرض النتائج، يمكنك النقر على هذه الأعمدة لفرز الجلسات باستخدام وحدة المعالجة المركزية أي جلسة حظر أخرى بغض النظر عن حالتها، بالإضافة إلى جلسات المقدمة النشطة حاليا. مثال راك، في مجموعات متعددة العقدة. إذا كان لديك مثيل واحد فقط سيكون 1 (قد ترغب في نقله إلى نهاية القائمة لإفساح المجال لأعمدة أخرى). آراء غف الخاصة بمركز الأنشطة الإقليمية (راك) تحتوي جميع مرات المشاهدة (وهي في الواقع مرادفات لمقالات sys. v) على نسخ مسبوقة من g - على سبيل المثال، غسسيون - تتضمن رقم المثيل، لاستخدامها في أنظمة تكييف الهواء. لنظم مثيل واحد هذا سوف يكون دائما 1. قائمة الوثائق فقط الإصدار الخامس، لذلك إذا كنت تريد أن تعرف عن غسسيون. على سبيل المثال، مجرد البحث عن فسسيون وتفترض سيكون هناك عمود إضافي واحد يدعى إنستيد. لقد استخدمت العادية v و راك جاهزة أسماء غف بالتبادل. نسخ الاستعلام أدناه في مربع الاستعلام (بعد اختباره في إطار سكل للتأكد من أنه يعمل مع إصدار أوراكل والأذونات - للوصول إلى V وجهات النظر التي تحتاجها سيليكتكاتالوغرول). لاحظ أنه لا توجد فاصلة منقوطة في النهاية. قد تحتاج أيضا إلى مراجعته ضد فسسيون في حالة وجود أعمدة مفيدة مفيدة لك. جلساتي في نظام مزدحم، وأحيانا تريد فقط أن ترى جلسات العمل الخاصة بك واستبعاد كل شيء آخر. لهذا يمكنني استخدام استعلام جلساتي، وهو نفسه كما هو موضح أعلاه باستثناء جملة وير، وهو: جميع الجلسات كما أنه في بعض الأحيان مفيد في أن يكون إصدار واحد الذي يظهر جميع الجلسات، بما في ذلك أوراكل سجل الكاتب، عملية رصد الخ ، قم بعمل نسخة أخرى من الاستعلام أعلاه، واترك مجرد عبارة وير. 4. الآن إضافة علامات التبويب التفاصيل الخاصة بك يستخدم هذا فسكلستاتس لعرض إحصاءات التنفيذ التفصيلية حول جلسات سكل الحالية بيان (التي حددها سكليد). لاحظ أنه يشير إلى جميع حالات المؤشر، وليس فقط هذه الدورة المكالمات الحالية. (أيضا، كما تعكس وجهات النظر V فقط ما في الذاكرة في الوقت الراهن، قد يكون مختلفا عما تراه في وجهات النظر داهيست إذا كان لديك حزمة التشخيص). فكرة النسب المئوية هو إظهار كيف ينهار الوقت المنقضي الإجمالي في وحدة المعالجة المركزية، إو، ينتظر التزامن الخ تقريبية فقط وأنها لا تضيف دائما ما يصل إلى 100، لأنه قد تكون هناك عوامل أخرى لم يتم حسابها مثل أوقات نقل الشبكة ومعالجة التطبيق، لكنها تعطيك فكرة عن كيفية البيان المصنعة. الآن يجب أن تحصل على علامة التبويب الإحصائيات سكل مثل الصورة أدناه لأي جلسة عمل تنفيذ سكل. (بيندس، السابق سكل وغيرها هي علامات تبويب أخرى تحديد إل في لحظة.) تاريخ بيرف هذا المؤشر إذا كان عبارة سكل يستغرق وقتا طويلا، قد ترغب في التحقق من سجل أدائها (من داهيستسكلستاتس) لمعرفة ما إذا كان هذا أمر طبيعي ل المؤشر أو ما إذا كان قد تغير شيء. الاستعلام الأول أدناه يعطي خطط التنفيذ المميزة وإحصاءات وقت التشغيل الخاصة بها، مجمعة لتاريخ كامل المؤشر، حتى تتمكن من رؤية متوسط ​​وقت التنفيذ وعما إذا كانت هناك خطط متعددة. (لاحظ الانضمام إلى غسكلبلان - g تشير إلى النسخة تمكين راك - التي ترى أن تكون الطريقة الأكثر موثوقية للعثور على خطة التنفيذ التي يتم استخدامها حاليا كما يتضمن رقم الطفل. كما فيسكلستاتس تقارير فقط صف واحد لكل سكليد مميزة ، قد لا تظهر الخطة للنسخة قيد التنفيذ حاليا.) النسخة الثانية - التي التسمية كما التاريخ بيرف هذا المؤشر حسب التاريخ - ينقسم نفس المعلومات حسب اليوم، حتى تتمكن من معرفة ما إذا كان ركض بسرعة يوم الثلاثاء الماضي، أو ما إذا كان خطة تغيير هذا الصباح: الاستعلام التالي سرد ​​أية متغيرات ربط عقد في فسكلبيندكابتور لبيان سكل الحالي. إيف تصفية النتائج لاستبعاد التكرارات. لاحظ أن أوراكل لا التقاط كل قيمة ربط واحد، ويحمل فقط القيمة الأخيرة التي تم التقاطها في الفاصل الزمني كورسوربيندكابتورينتيرفال واعتمادا على مقدار المساحة المتاحة تصل إلى كورسوربيندكابتورياريزي. البديل هو الحصول على ربط البيانات المستخدمة في تحليل الوقت من فسكلبلان. على الرغم من أن هذا يأخذ بعض فك كما عقد في شكل راو داخل عمود شمل - انظر جوناثان لويس بلوق وظيفة ربط التقاط. الذي يربط إلى إنشاء مخطوطات اختبار مع ربط متغيرات من كيري أوسبورن وتتبع قيمة ربط من ديون تشو. هذا أدى لي إلى الاستعلام التالي باستخدام فكرة من كايل هايلي في التعليقات على جوناثان لويس آخر: في بلدي الاختبارات باستخدام أوراكل 11.2.0.2 هذا يحذف أسماء ربط. على أي حال التقاط القيم ربط هو موضوع كبير، لذلك إل أترك لكم مع الاستفسارات أعلاه للتجربة، والانتقال. سكل السابق، احصائيات سكل السابقة في بعض الأحيان من المفيد أن نرى ما كان البيان السابق. فسسيون يحتوي على عدة أعمدة سابقة، لذلك مجرد تكرار علامات التبويب التفاصيل ل سكل تكست و سكل الإحصائيات ولكن بديل بريفسكليد و بريفشيلدنومبر. إحصائيات الكائن عند البحث عن مشكلة في الأداء، غالبا ما تريد التحقق من الحالة الحالية للإحصاءات الموجودة على الجداول المعنية في الاستعلام. الاستعلام أدناه ينضم فسكلبلانستاتيستيكال مع دباتابستاتيستيكش لسرد هذه المعلومات - ليست مثالية إذا كانت الجداول المقسمة تشارك، حيث أن المشكلة قد تكمن مع احصائيات لقسم الفردية أو التقسيم الفرعي، ولكن بداية لها. استبدال الاستعلام المؤشر الافتراضي (حدد من فوبينسورسور حيث سيد: سيد) مع ما يلي لإضافة بعض إحصاءات النشاط. (لاحظ أن إحصاء الإعدام يشير إلى جميع الجلسات، وليس فقط في الدورة الحالية.) الخطة الحالية بلسل المطورين المدمج في شرح أداة الخطة (F5) هو كل شيء جيد وجيد، ولكن يمكن أن يكون فقط جيدة كما شرح الخطة. وهذا هو، يستخدم أداة شرح خطة للتنبؤ خطة التنفيذ، ثم يعرض النتائج في شكل رسوم بيانية. في بعض الأحيان هذا ليس نفس خطة التشغيل الفعلية. عند عرض جلسات التنفيذ حاليا، وأود أن استخدام dbmsxplan. displaycursor () لمعرفة ما تقوم به قاعدة البيانات في الواقع. تحديد علامة التبويب خطة الحالية باستخدام ما يلي: التعليق المتسلسل سيجعل بلسكل المطور التفاف جميع خطوط الإخراج من الاستعلام في كتلة واحدة كبيرة. هذا يجعل من الاسهل للقراءة، على الرغم من أنها لا تمنع أيضا التمرير حتى إم غير متأكد من كل ما هو مفيد هنا. (لسوء الحظ لا يمكنك تحديد خط مونوسباس لعنصر القائمة الفردية، وبالتالي فإن العرض الافتراضي ليس ذلك عظيم). أفضل طريقة لقراءته هو نسخ ولصق في نافذة سكل جديدة. هذا أسهل إذا قمت بتعريف مفتاح التشغيل السريع مثل ألت-S لنافذة غ سكل نافذة جديدة غ كما اقترح في الجزء 1. (كما أن لدي ملحق موسع المتصفح للقيام بذلك في نقرة واحدة بزر الماوس الأيمن فوق، الذي يأتي إلى وقت لاحق. ) أنا أيضا استخدام شكل آخر من هذا الاستعلام، الذي إيف المسمى الحالي خطة غس (جمع خطة الاحصائيات - على الرغم من أن ربما خطة الموسعة سيكون اسم أفضل الآن أفكر في ذلك). يستخدم هذا ألستاتس لاست في الوسيطة تنسيق dbmsxplan. displaycursor للحصول على عدد الصفوف المقدرة والفعلية (كارديناليتي) إذا كان الاستعلام يستخدم تلميح غاثروبلانستاتيستيكش، أو إذا تم تعيين إحصائية المعلمة إلى آل للدورة. الجزء الخادع قليلا مع هذا هو أنه غير قادر على استخدامه حتى الاستعلام يكمل (لأن الصف الفعلي يحسب أرينت بعد المعروفة)، ولكن عند اكتمال فإنه لم يعد الاستعلام المنفذة حاليا، وبالتالي يختفي من فسسيون، وذلك عند تحديث متصفح الجلسة الخاصة بك ذهب. بدلا من ذلك تحتاج إلى تحديث الشاشة أثناء تنفيذ الاستعلام، ولكن انتظر حتى يكمل قبل الذهاب إلى علامة التبويب غس الخطة الحالية. الانتظار الحالي على الرغم من أن الدورة الحالية حالة الانتظار هو مبين بالفعل في الاستعلام الرئيسي، أعلاه، أود أن يكون المعلومات في علامة التبويب الخاصة بها أيضا. إيف وصفت كائن الانتظار بدا من روويتوبج كما ربما لا علاقة لها كتذكير أنه على الرغم من أن هذا هو الكائن الأخير الذي انتظرت الدورة ل، قد الآن معالجة انتقلت إلى شيء آخر (فرز الإخراج على سبيل المثال، أو انتظار تطبيق ل معالجة إخراج الاستعلام) والجلسة لا تصل في الواقع إلى هذا الكائن في الوقت الحالي. آخر 10 انتظارات يعطي ما يلي نظرة سريعة على نشاط الجلسات الأخيرة باستخدام فسسيونويثيستوري (وقت الانتظار هو في مائة من الثانية): فسسيونلونغوبس يعرض حالة العمليات المختلفة التي تعمل لمدة أطول من 6 ثوان. وتشمل هذه العمليات حاليا العديد من وظائف النسخ الاحتياطي والاسترداد، وجمع الإحصاءات، وتنفيذ الاستعلام، ويتم إضافة المزيد من العمليات لكل إصدار أوراكل. إذا كان الاستعلام يستخدم تجزئة أو عمليات فرز، الجدول بمسح، عمليات التقسيم وغيرها التي تأخذ أكثر من 6 ثوان، وسوف تظهر هذه العمليات في فسسيونلونغوبس ويمكنك تتبع التقدم المحرز. (لاحظ أن عملياتها الفردية الوحيدة التي يتم تعقبها، وليس الاستعلامات بأكملها.) العديد من عمليات أوراكل طويلة الأمد هي أيضا أداة، كما يشير الدليل. البعض غير المذكورة أعلاه تشمل إعادة تشغيل قاعدة البيانات و سكل بيرفورمانس أنليزر يعمل (11g) و داتابومب إمبورتيكسورت وظائف - وبالطبع أي من العمليات الخاصة بك حيث قمت بتضمين dbmsapplicationinfo. setsessionlongops المكالمات لتسجيل إجمالي العمل والمبلغ التي تمت معالجتها حتى الآن. أنا أيضا تحديد لونغ أوبس علامة التبويب الاستعلام هذه، وهي نسخة من واحد أعلاه ولكن مع مرشح إضافي للحد من العملية الحالية: ملخص آش - جلسة فاكتيفيسيونهيستوري هو لقطة من فسسيون اتخذت مرة واحدة كل ثانية، عقد ل فترة محدودة (عادة 30 إلى 60 دقيقة) ومن ثم تخزينها في داهيستاكتيفيسيسهيستوري. (لاستخدام هذا تحتاج حزمة التشخيص، لذلك تأكد من أنك مرخص حتى لو كان يعمل - كنت لا تريد رئيسك للحصول على فاتورة غير متوقعة بعد التدقيق أوراكل). هناك العديد من الطرق الإبداعية لإزالة هذه المعلومات، وأنا استخدم ثلاثة استفسارات لتتبع ما تقوم به جلسة قيد التشغيل حاليا. منذ عينات آش كل ثانية، يمكن أن يكون من المفيد تلخيص ذلك من خلال بيان سكل وقائمة النتائج حسب الوقت المستغرق. إذا كنت تشاهد إجراء أو دفعة تستدعي عدة عبارات، فهذا يعطي لمحة عامة عن المكان الذي تقضي فيه الجلسة وقتها (مثل تتبع الجلسة). الاستعلام التالي يعطي صف واحد لكل سكليد. في ترتيب تنازلي من إجمالي الوقت، مع المجموع في الجزء السفلي. ملخص آش - الإعدام لدي أيضا نسخة أكثر تفصيلا جعلت ممكنة في 11G من قبل العمود سكليكستارت في فاكتيفيسسيونيستوري. والذي يسمح لي أن أرى عمليات إعدام فردية لبيان سكل بدلا من صف مجمع واحد. ملخص آش - كائن مؤشر الوقت تقرير يعرض الكائنات التي انتظرتها كل عبارات سكل للدورة المحددة. وهذا يعني وسيلة سريعة لمعرفة ما قضت جلسة وقتها على، من حيث الكائنات بدلا من الاستفسارات. آش ملخص هذا الاستعلام مع المتصل التالي لدي استعلام غروب-بي ل سكليد الحالي. في ترتيب تنازلي للعينة العينة. والفكرة هي معرفة أين يقضي الوقت في البيان الجاري تنفيذه حاليا (وليس البيانات التي استغرقت وقتا في الدورة الحالية). منذ تاريخ الدورة النشطة يستخدم فاصل اقتراع 1 ثانية، وهو ما يحدث في 10 عينات قد يستغرق حوالي 10 ثانية. لاحظ أنه يرشح فقط على سكليد. بحيث يتم تجميع عمليات إعدام متعددة لنفس الاستعلام من قبل الجلسة معا. (في 11g يمكنك استخدام العمود سكليكسيسيد الجديد للتمييز بين عمليات الإعدام.) نضع في اعتبارنا أيضا أن آش قد عينة النشاط كما على وحدة المعالجة المركزية جنبا إلى جنب مع كائن قاعدة البيانات - وهذا يعني فقط آخر كائن الوصول في الوقت الذي تم أخذ العينة، وليس أن وحدة المعالجة المركزية كانت مرتبطة بالضرورة بهذا الكائن. لدي اثنين من النكهات من هذا واحد، مع وبدون تفاصيل الإجراء بلسكل الدعوة. آش ملخص هذا الاستعلام سكل فقط هذا هو نفس الاستعلام السابق، ولكن دون تفاصيل بلسكل الدعوة لإعطاء رؤية أوضح من الوصول إلى قاعدة البيانات. آش تفاصيل هذه الجلسة وأخيرا، لدي قائمة مباشرة من فاكتيفيسسيونهيستوري للدورة الحالية حتى تتمكن من الحصول على فكرة عما تقوم به حاليا: الإعداد الافتراضي لا يأتي مع علامة التبويب الأقفال. يتيح القول أن جلسة تنفيذ الإجراءات التالية: علامة التبويب الأقفال الافتراضية يعرض هذا: تغييره إلى ما يلي يوفر بعض التفاصيل: محسن غير قياسي أجد هذا مفيد للتحقق من إعدادات محسن قيد الاستخدام من قبل جلسة معينة (والتي قد لا تكون نفس إعدادات جلسة العمل أو افتراضيات المثيل). هذا ينضم فسيسوبتيميزيرينف (معلمات النظام ذات الصلة محسن) مع فسيسوبتيميزيرينف (معلمات الدورة المتعلقة المحسن، ورثت في البداية من إعدادات على مستوى النظام ولكن تعكس أي تغييرات أجريت عن طريق تغيير أوامر الدورة) والتقارير الاختلافات. تيمب سباس كم مساحة مؤقتة هي هذه الجلسة باستخدام ل تجزئة ينضم، أنواع الخ نسخة وليام روبرتسون 2011 الاشتراك في المقالات الاشتراك في التعليمات البرمجية و sccripts22 سكل للتحليل والإبلاغ وقد تعزز أوراكل قدرات معالجة سكل التحليلية من خلال إدخال عائلة جديدة من وظائف سكل التحليلية. هذه الوظائف التحليلية تمكنك من حساب: الترتيب والمئوية تحويل الحسابات نافذة الخطي الإحصاءات الانحدار وتشمل وظائف التصنيف التراكمي التوزيعات، رتبة المئة، و N - البلاط. تسمح لك حسابات النوافذ المتحركة بالبحث عن التجميعات المتحركة والتراكمية، مثل المبالغ والمتوسطات. يتيح تحليل لاغليد مراجع مباشرة بين الصفوف بحيث يمكنك حساب التغييرات من فترة إلى أخرى. تحليل فيرستلاست تمكنك من العثور على القيمة الأولى أو الأخيرة في مجموعة مرتبة. وتشمل التحسينات الأخرى ل سكل التعبير كيس والانضمام الخارجي مقسمة. تعبيرات حالة توفر إذا-المنطق ثم مفيدة في العديد من الحالات. التقسيم الخارجي المقسم هو امتداد لتركيب أنسي الخارجي الذي يسمح للمستخدمين بتكثيف بعض الأبعاد بشكل انتقائي مع إبقاء الآخرين متفرقين. ويتيح ذلك أدوات إعداد التقارير من أجل انتقاء الأبعاد بشكل انتقائي، على سبيل المثال، تلك التي تظهر في تقاريرها عبر الجداول مع إبقاء الآخرين متفرقين. ولتعزيز الأداء، يمكن موازاة الوظائف التحليلية: يمكن لعمليات متعددة تنفيذ كل هذه العبارات في وقت واحد. هذه القدرات تجعل الحسابات أسهل وأكثر كفاءة، وبالتالي تعزيز أداء قاعدة البيانات، والتدرجية، والبساطة. وتصنف الوظائف التحليلية على النحو المبين في الجدول 22-1. جدول 22-1 الوظائف التحليلية واستخداماتها لتنفيذ هذه العمليات، تقوم الدالات التحليلية بإضافة عدة عناصر جديدة إلى معالجة سكل. هذه العناصر بناء على سكل القائمة للسماح تعبيرات حساب مرنة وقوية. وباستثناءات قليلة فقط، فإن الوظائف التحليلية لها هذه العناصر الجديدة. يتم تمثيل تدفق المعالجة في الشكل 22-1. الشكل 22-1 ترتيب المعالجة المفاهيم الأساسية المستخدمة في الوظائف التحليلية هي: معالجة الاستعلام باستخدام الوظائف التحليلية تجري على ثلاث مراحل. أولا، كل ينضم، وير. يتم تنفيذ شروط المجموعة و هافينغ. ثانيا، يتم توفير مجموعة النتائج للوظائف التحليلية، وجميع حساباتهم تجري. ثالثا، إذا كان الاستعلام يحتوي على جملة أوردر بي في نهايته، تتم معالجة أوردر بي للسماح بترتيب إخراج دقيق. ويرد ترتيب المعالجة في الشكل 22-1. أقسام مجموعة النتائج تتيح الوظائف التحليلية للمستخدمين تقسيم مجموعات نتائج الاستعلام إلى مجموعات من الصفوف تسمى الأقسام. لاحظ أن المصطلحات المستخدمة مع الدالات التحليلية لا علاقة إلى ميزة أقسام الجدول. خلال هذا الفصل، يشير مصطلح الأقسام إلى المعنى المرتبط بالوظائف التحليلية فقط. يتم إنشاء أقسام بعد المجموعات المحددة مع بنود غروب بي، بحيث تكون متاحة لأي نتائج مجمعة مثل المبالغ والمتوسطات. قد تستند أقسام التقسيم إلى أي أعمدة أو تعبيرات مطلوبة. قد يتم تقسيم مجموعة نتائج الاستعلام إلى قسم واحد فقط يحمل كل الصفوف أو بعض الأقسام الكبيرة أو العديد من الأقسام الصغيرة التي تحتوي على عدد قليل من الصفوف لكل منها. لكل صف في قسم، يمكنك تعريف نافذة انزلاق البيانات. تحدد هذه النافذة مجموعة الصفوف المستخدمة لتنفيذ العمليات الحسابية للصف الحالي. يمكن أن تستند أحجام النوافذ إما إلى عدد فعلي من الصفوف أو الفاصل الزمني المنطقي مثل الوقت. يحتوي الإطار على صف البداية و صف النهاية. اعتمادا على تعريفها، نافذة قد تتحرك في واحد أو طرفي. على سبيل المثال، نافذة محددة لوظيفة المجموع التراكمي سيكون لها صف البداية التي تم تحديدها في الصف الأول من قسمها، و صف النهاية ينزلق من نقطة البداية على طول الطريق إلى الصف الأخير من القسم. وعلى النقيض من ذلك، فإن النافذة المحددة للمتوسط ​​المتحرك سيكون لها كلا من نقطتي البداية والنهاية تنزلق بحيث تحافظ على مدى ثابت أو منطقي ثابت. يمكن تعيين نافذة كبيرة مثل جميع الصفوف في قسم أو مجرد نافذة منزلقة من صف واحد داخل قسم. عندما تكون نافذة بالقرب من الحدود، تقوم الدالة بإرجاع النتائج للصفوف المتاحة فقط، بدلا من تحذيرك بأن النتائج ليست ما تريده. عند استخدام وظائف النافذة، يتم تضمين الصف الحالي أثناء العمليات الحسابية، لذا يجب تحديد فقط (ن -1) عند التعامل مع العناصر n. ويستند كل حساب مع وظيفة تحليلية على صف الحالي داخل قسم. يعمل الصف الحالي كنقطة مرجعية تحدد بداية ونهاية النافذة. على سبيل المثال، يمكن تعريف حساب متوسط ​​متحرك مركز بنافذة تحمل الصف الحالي، الصفوف الستة السابقة، والصفوف الستة التالية. وهذا من شأنه أن يخلق نافذة منزلقة من 13 صف، كما هو مبين في الشكل 22-2. الشكل 22-2 مثال على النافذة المنزلقة، وظائف النافذة، وإعداد التقارير يوضح هذا القسم الوظائف التحليلية الأساسية للترتيب والنافذة وإعداد التقارير. نموذج حساب الانحدار الخطي في هذا المثال، نقوم بحساب خط الانحدار العادي أقل المربعات التي تعبر عن الكمية المباعة للمنتج كدالة خطية من قائمة المنتجات. يتم تجميع الحسابات حسب قناة المبيعات. القيم سلوب. INTCPT. رسكر هي المنحدر، اعتراض، ومعامل تحديد خط الانحدار، على التوالي. القيمة كونت (العدد الصحيح) كونت هي عدد المنتجات في كل قناة تتوفر لكلا من الكمية المعروضة وبيانات قائمة الأسعار. المجاميع الإحصائية توفر أوراكل مجموعة من الدالات الإحصائية سكل ومجموعة إحصائية، دبستاتفونكس. يسرد هذا القسم بعض الوظائف الجديدة جنبا إلى جنب مع بناء الجملة الأساسي. إحصائيات وصفية يمكنك حساب الاحصاءات الوصفية التالية: متوسط ​​لمجموعة وضع البيانات لمجموعة بيانات يمكنك حساب الاحصاءات البارامترية التالية: معامل سبيرمانز رو معامل كندالز تاو-ب بالإضافة الى الوظائف، هذا الاصدار يحتوي على حزمة بلسكل، دبمستاتفونكس . أنه يحتوي على الدالة الإحصائية الوصفية سوماري جنبا إلى جنب مع وظائف لدعم تركيب التوزيع. تلخص الدالة سوماري العمود العددي من جدول مع مجموعة متنوعة من الإحصاءات الوصفية. وظائف توزيع التوزيع خمسة تدعم العادي، موحدة، ويبول، بواسون، والتوزيعات الأسية. المجاميع المعرفة من قبل المستخدم توفر أوراكل منشأة لإنشاء وظائفك الخاصة، وتسمى الوظائف المجمعة المعرفة من قبل المستخدم. تتم كتابة هذه الوظائف في لغات البرمجة مثل بلسكل، جافا، و C، ويمكن استخدامها كدالات تحليلية أو المجاميع في وجهات نظر مادية. راجع دليل مطوري بيانات خرطوشة بيانات قاعدة بيانات أوراكل للحصول على مزيد من المعلومات حول بناء الجملة والقيود. مزايا هذه الوظائف هي: وظائف معقدة للغاية يمكن برمجتها باستخدام لغة إجرائية تماما. زيادة قابلية التوسع من التقنيات الأخرى عندما يتم برمجة وظائف المعرفة من قبل المستخدم للمعالجة المتوازية. يمكن معالجة أنواع بيانات الكائنات. كمثال بسيط لوظيفة تجميع المعرفة من قبل المستخدم، والنظر في الإحصاء الانحراف. يقيس هذا الحساب إذا كان لمجموعة البيانات توزيع غير متوازن حول متوسطها. وسوف اقول لكم إذا ذيل واحد من التوزيع هو أكبر بكثير من الآخر. إذا قمت بإنشاء مجمع معرفة من قبل المستخدم يسمى أودسكو وتطبيقه على بيانات حد الائتمان في المثال السابق، قد تبدو عبارة سكل والنتائج على النحو التالي: قبل بناء الدالات المجمعة المعرفة من قبل المستخدم، يجب عليك مراعاة ما إذا كان يمكن تلبية الاحتياجات الخاصة بك في سكل العادية. العديد من العمليات الحسابية المعقدة ممكنة مباشرة في سكل، خاصة باستخدام تعبير كيس. سوف البقاء مع سكل العادية تمكين التنمية أبسط، والعديد من عمليات الاستعلام بالفعل متوازية بشكل جيد في سكل. حتى المثال السابق، إحصائية الانحراف، يمكن إنشاؤها باستخدام معيار، وإن كان مطولا، سكل. عمليات المحورية غالبا ما تكون الاستفسارات المتعلقة بذكاء الأعمال هي الأكثر استخداما إذا تم تقديمها في شكل جدولي. يسمح لك بيفوتكلوس من عبارة سيليكت بكتابة استعلامات جدول البحث التي تدور الصفوف في الأعمدة، تجميع البيانات في عملية دوران. ويعتبر المحورية تقنية رئيسية في مستودعات البيانات. في ذلك، يمكنك تحويل صفوف متعددة من المدخلات في صفوف أقل وعموما أوسع في مستودع البيانات. عند المحورية، يتم تطبيق عامل تجميع للتجميع لكل عنصر في قائمة قيم العمود المحوري. لا يمكن أن يحتوي العمود المحوري على تعبير تعسفي. إذا كنت بحاجة إلى المحورية على تعبير، فيجب عليك تسمية الاسم في طريقة عرض قبل عملية بيفوت. بناء الجملة الأساسي كما يلي: لتوضيح استخدام المحورية إنشاء العرض التالي كأساس لأمثلة لاحقة: مثال: المحورية يوضح العبارة التالية محور نموذجي على عمود القناة: لاحظ أن الإخراج قد خلق أربعة أعمدة مستعارة جديدة ، مبيعات مباشرة. INTERNETSALES. CATALOGSALES. و تليزاليس. واحد لكل من القيم المحورية. الإخراج هو المبلغ. إذا لم يتم توفير اسم مستعار، سيكون عنوان العمود قيم إن-ليست. المحورية على أعمدة متعددة يمكنك المحورية في أكثر من عمود واحد. يوضح البيان التالي محور نموذجي متعدد الأعمدة: لاحظ أن هذا المثال يحدد عمود متعدد إن - list مع عناوين الأعمدة المصممة لمطابقة أعضاء إن-ليست. المحورية: التجميعات المتعددة يمكنك المحورية مع مجموعات تجميع متعددة، كما هو موضح في المثال التالي: لاحظ أن الاستعلام ينشئ عناوين الأعمدة من خلال ربط القيم المحورية (أو الاسم المستعار) بالاسم المستعار لوظيفة التجميع، بالإضافة إلى تسطير أسفل السطر. تمييز نولس بيفوت من نولس في بيانات المصدر يمكنك التمييز بين القيم الفارغة التي يتم إنشاؤها من استخدام بيفوت وتلك الموجودة في بيانات المصدر. يوضح المثال التالي الخانات التي يولدها بيفوت. الاستعلام التالي بإرجاع الصفوف مع 5 أعمدة، العمود بروديد. والأعمدة المحورية الناتجة Q1. Q1COUNTTOTAL. Q2. Q2COUNTTOTAL. لكل قيمة فريدة من بروديد. Q1COUNTTOTAL بإرجاع العدد الإجمالي للصفوف التي تكون قيمة Q1 Q. وهذا هو، و Q2COUNTTOTAL بإرجاع العدد الإجمالي للصفوف التي تكون قيمة Q2 Q. نفترض أن لدينا جدول المبيعات 2 من البنية التالية: من نتيجة، ونحن نعلم أن ل بروديد 100، وهناك 2 صفوف المبيعات للربع Q1. و 1 صف المبيعات للربع Q2 ل بروديد 200، هناك 1 صف المبيعات للربع Q1. وليس هناك صف مبيعات للربع Q2.So، في Q2COUNTTOTAL. يمكنك تحديد أن NULLlt1gt يأتي من صف في الجدول الأصلي الذي يكون قياسه قيمة فارغة بينما NULLlt2gt يرجع إلى عدم وجود صف في الجدول الأصلي ل بروديد 200 في الربع Q2. أونبيفوتينغ أوبيراتيونس لا يقوم أونبيفوت بعكس عملية بيفوت. وبدلا من ذلك، تقوم بتدوير البيانات من الأعمدة إلى صفوف. إذا كنت تعمل مع بيانات محورة، لا يمكن لعملية ونبيفوت عكس أي تجميعات تم إجراؤها بواسطة بيفوت أو أي وسيلة أخرى. لتوضيح عدم التنشيط، قم أولا بإنشاء جدول محوري يتضمن أربعة أعمدة، لأرباع السنة: محتويات الجداول تشبه ما يلي: تقوم عملية ونبيفوت التالية بتدوير أعمدة الربع إلى صفوف. لكل منتج، سيكون هناك أربعة صفوف، واحدة لكل ربع سنة. لاحظ استخدام إينلود نولز في هذا المثال. يمكنك أيضا استخدام إكسكلود نولز. وهو الإعداد الافتراضي. بالإضافة إلى ذلك، يمكنك أيضا أونبيفوت باستخدام عمودين، كما في ما يلي: حرف البدل والتحويل الفرعي الاستعلام مع عمليات شمل إذا كنت ترغب في استخدام وسيطة بدل أو استعلام فرعي في أعمدة المحورية الخاصة بك، يمكنك القيام بذلك مع بناء جملة شمل بيفوت. مع بيفوت شمل، يتم تنسيق إخراج العملية بتنسيق شمل بشكل صحيح. يوضح المثال التالي استخدام الكلمة الرئيسية أحرف البدل، أي. يتم إخراج شمل الذي يتضمن كافة قيم القناة في ساليسفيو: لاحظ أن الكلمة الرئيسية أي متوفرة في عمليات بيفوت فقط كجزء من عملية شمل. ويشمل هذا الإخراج بيانات عن الحالات التي توجد فيها القناة في مجموعة البيانات. لاحظ أيضا أن دالات التجميع يجب أن تحدد جملة غروب بي لإرجاع قيم متعددة، ومع ذلك، لا يحتوي بيفوتكلوس على عبارة غروب بي صريحة. بدلا من ذلك، يؤدي بيفوتلوس ضمنية غروب بي. يوضح المثال التالي استخدام طلب بحث فرعي. وهو يقوم بإخراج شمل الذي يتضمن جميع قيم القناة وبيانات المبيعات المقابلة لكل قناة: ويؤدي الإنتاج إلى زيادة كثافة البيانات لتشمل جميع القنوات الممكنة لكل منتج. يتم عادة تخزين البيانات لبيانات التقارير في شكل متفرق. وهذا يعني أنه في حالة عدم وجود قيمة لمجموعة معينة من قيم الأبعاد، لا يوجد صف في جدول الحقائق. ومع ذلك، قد ترغب في عرض البيانات في شكل كثيف، مع عرض صفوف لكل مجموعة قيم الأبعاد حتى في حالة عدم وجود بيانات حقيقة لها. على سبيل المثال، إذا لم يبيع أحد المنتجات خلال فترة زمنية معينة، فقد لا تزال ترغب في مشاهدة المنتج خلال تلك الفترة الزمنية مع عدم وجود قيمة مبيعات بجواره. وعلاوة على ذلك، الحسابات التسلسل الزمني يمكن أن يؤديها بسهولة أكبر عندما تكون البيانات كثيفة على طول البعد الزمني. وذلك لأن البيانات الكثيفة سوف تملأ عددا متسقا من الصفوف لكل فترة، وهذا بدوره يجعل من السهل استخدام وظائف النافذة التحليلية مع تعويضات المادية. تكثيف البيانات هو عملية تحويل البيانات متفرق في شكل كثيف. للتغلب على مشكلة متفرق، يمكنك استخدام الانضمام الخارجي مقسمة لملء الفجوات في سلسلة زمنية أو أي بعد آخر. مثل هذا الانضمام يمتد بناء جملة الانضمام الخارجي التقليدي عن طريق تطبيق الارتباط الخارجي إلى كل قسم منطقي المعرفة في الاستعلام. تقسم أوراكل منطقيا الصفوف في طلب البحث استنادا إلى التعبير الذي تحدده في بند بارتيتيون بي. نتيجة للانضمام الخارجي مقسمة هو ونيون من ينضم الخارجي من كل من الأقسام في الجدول تقسيم منطقيا مع الجدول على الجانب الآخر من الانضمام. لاحظ أنه يمكنك استخدام هذا النوع من الارتباط لملء الفجوات في أي بعد، وليس بعد الوقت فقط. وتركز معظم الأمثلة هنا على البعد الزمني لأنه البعد الأكثر استخداما كأساس للمقارنات. بارتيتيون تاريخ التركيب بناء الجملة للانضمام الخارجي المقسم يمتد جملة أنسي سكل جوين باستخدام عبارة بارتيتيون بي متبوعة بقائمة تعبير. تحدد التعبيرات في القائمة المجموعة التي يتم تطبيق الارتباط الخارجي عليها. وفيما يلي شكلين من بناء الجملة المستخدمة عادة للانضمام الخارجي مقسمة: لاحظ أن كامل أوتور جوين غير معتمد مع الانضمام الخارجي مقسمة. Sample of Sparse Data A typi cal situation with a sparse dimension is shown in the following example, which computes the weekly sales and year-to-date sales for the product Bounce for weeks 20-30 in 2000 and 2001: In this example, we would expect 22 rows of data (11 weeks each from 2 years) if the data were dense. However, we get only 18 rows because weeks 25 and 26 are missing in 2000, and weeks 26 and 28 in 2001. Filling Gaps in Data We can take the sparse data of the preceding query and do a partitioned outer join with a dense set of time data. In the following query, we alias our original query as v and we select data from the times table, which we alias as t. Here we retrieve 22 rows because there are no gaps in the series. The four added rows each have 0 as their Sales value set to 0 by using the NVL function. Note that in this query, a WHERE condition was placed for weeks between 20 and 30 in the inline view for the time dimension. This was introduced to keep the result set small. Filling Gaps in Two Dimensions N-dimensional data is typically displayed as a dense 2-dimensional cross tab of (n - 2) page dimensions. This requires that all dimension values for the two dimensions appearing in the cross tab be filled in. The following is another example where the partitioned outer join capability can be used for filling the gaps on two dimensions: In this query, the WITH subquery factoring clause v1 summarizes sales data at the product, country, and year level. This result is sparse but users may want to see all the country, year combinations for each product. To achieve this, we take each partition of v1 based on product values and outer join it on the country dimension first. This will give us all values of country for each product. We then take that result and partition it on product and country values and then outer join it on time dimension. This will give us all time values for each product and country combination. Filling Gaps in an Inventory Table An inventory table typically tracks quantity of units available for various products. This table is sparse: it only stores a row for a product when there is an event. For a sales table, the event is a sale, and for the inventory table, the event is a change in quantity available for a product. For example, consider the following inventory table: The inventory table now has the following rows: For reporting purposes, users may want to see this inventory data differently. For example, they may want to see all values of time for each product. This can be accomplished using partitioned outer join. In addition, for the newly inserted rows of missing time periods, users may want to see the values for quantity of units column to be carried over from the most recent existing time period. The latter can be accomplished using analytic window function LASTVALUE value. Here is the query and the desired output: The inner query computes a partitioned outer join on time within each product. The inner query densifies the data on the time dimension (meaning the time dimension will now have a row for each day of the week). However, the measure column quantity will have nulls for the newly added rows (see the output in the column quantity in the following results. The outer query uses the analytic function LASTVALUE. Applying this function partitions the data by product and orders the data on the time dimension column ( timeid ). For each row, the function finds the last non-null value in the window due to the option IGNORE NULLS. which you can use with both LASTVALUE and FIRSTVALUE. We see the desired output in the column repeatedquantity in the following output: Computing Data Values to Fill Gaps Examples in previous section illustrate how to use partitioned outer join to fill gaps in one or more dimensions. However, the result sets produced by partitioned outer join have null values for columns that are not included in the PARTITION BY list. Typically, these are measure columns. Users can make use of analytic SQL functions to replace those null values with a non-null value. For example, the following q uery computes monthly totals for products 64MB Memory card and DVD-R Discs (product IDs 122 and 136) for the year 2000. It uses partitioned outer join to densify data for all months. For the missing months, it then uses the analytic SQL function AVG to compute the sales and units to be the average of the months when the product was sold. If working in SQLPlus, the following two commands wraps the column headings for greater readability of results: Time Series Calculations on Densified Data Densificatio n is not just for reporting purpose. It also enables certain types of calculations, especially, time series calculations. Time series calculations are easier when data is dense along the time dimension. Dense data has a consistent number of rows for each time periods which in turn make it simple to use analytic window functions with physical offsets. To illustrate, let us first take the example on Filling Gaps in Data. and lets add an analytic function to that query. In the following enhanced version, we calculate weekly year-to-date sales alongside the weekly sales. The NULL values that the partitioned outer join inserts in making the time series dense are handled in the usual way: the SUM function treats them as 0s. Period-to-Period Comparison for One Time Level: Example How do we use this feature to compare values across time periods Specifically, how do we calculate a year-over-year sales comparison at the week level The following query returns on the same row, for each product, the year-to-date sales for each week of 2001 with that of 2000. Note that in this example we start with a WITH clause. This improves readability of the query and lets us focus on the partitioned outer join. If working in SQLPlus, the following command wraps the column headings for greater readability of results: In the FROM clause of the inline view densesales. we use a partitioned outer join of aggregate view v and time view t to fill gaps in the sales data along the time dimension. The output of the partitioned outer join is then processed by the analytic function SUM. OVER to compute the weekly year-to-date sales (the weeklyytdsales column). Thus, the view densesales computes the year-to-date sales data for each week, including those missing in the aggregate view s. The inline view yearoveryearsales then computes the year ago weekly year-to-date sales using the LAG function. The LAG function labeled weeklyytdsalesprioryear specifies a PARTITION BY clause that pairs rows for the same week of years 2000 and 2001 into a single partition. We then pass an offset of 1 to the LAG function to get the weekly year to date sales for the prior year. The outermost query block selects data from yearoveryearsales with the condition yr 2001. and thus the query returns, for each product, its weekly year-to-date sales in the specified weeks of years 2001 and 2000. Period-to-Period Comparison for Multiple Time Levels: Example While the prior example shows us a way to create comparisons for a single time level, it would be even more useful to handle multiple time levels in a single query. For example, we could compare sales versus the prior period at the year, quarter, month and day levels. How can we create a query which performs a year-over-year comparison of year-to-date sales for all levels of our time hierarchy We will take several steps to perform this task. The goal is a single query with comparisons at the day, week, month, quarter, and year level. The steps are as follows: We will create a view called cubeprodtime. which holds a hierarchical cube of sales aggregated across times and products . Then we will create a view of the time dimension to use as an edge of the cube. The time edge, which holds a complete set of dates, will be partitioned outer joined to the sparse data in the view cubeprodtime . Finally, for maximum performance, we will create a materialized view, mvprodtime. built using the same definition as cubeprodtime . For more information regarding hierarchical cubes, see Chapter 21, SQL for Aggregation in Data Warehouses. The materialized view is defined in Step 1 in the following section. Step 1 Create the hierarchical cube view The materialized view shown in the following may already exist in your system if not, create it now. If you must generate it, note that we limit the query to just two products to keep processing time short: Because this view is limited to two products, it returns just over 2200 rows. Note that the column HierarchicalTime contains string representations of time from all levels of the time hierarchy. The CASE expression used for the HierarchicalTime column appends a marker (0, 1. ) to each date string to denote the time level of the value. A 0 represents the year level, 1 is quarters, 2 is months, and 3 is day. Note that the GROUP BY clause is a concatenated ROLLUP which specifies the rollup hierarchy for the time and product dimensions. The GROUP BY clause is what determines the hierarchical cube contents. Step 2 Create the view edgetime, which is a complete set of date values edgetime is the source for filling time gaps in the hierarchical cube using a partitioned outer join. The column HierarchicalTime in edgetime will be used in a partitioned join with the HierarchicalTime column in the view cubeprodtime. The following statement defines edgetime : Step 3 Create the materialized view mvprodtime to support faster performance The materialized view definition is a duplicate of the view cubeprodtime defined earlier. Because it is a duplicate query, references to cubeprodtime will be rewritten to use the mvprodtime materialized view. The following materialized may already exist in your system if not, create it now. If you must generate it, note that we limit the query to just two products to keep processing time short. Step 4 Create the comparison query We have now set the stage for our comparison query. We can obtain period-to-period comparison calculations at all time levels. It requires applying analytic functions to a hierarchical cube with dense data along the time dimension. Some of the calculations we can achieve for each time level are: Sum of sales for prior period at all levels of time. Variance in sales over prior period. Sum of sales in the same period a year ago at all levels of time. Variance in sales over the same period last year. The following example performs all four of these calculations. It uses a partitioned outer join of the views cubeprodtime and edgetime to create an inline view of dense data called densecubeprodtime. The query then uses the LAG function in the same way as the prior single-level example. The outer WHERE clause specifies time at three levels: the days of August 2001, the entire month, and the entire third quarter of 2001. Note that the last two rows of the results contain the month level and quarter level aggregations. Note that to make the results easier to read if you are using SQLPlus, the column headings should be adjusted with the following commands. The commands will fold the column headings to reduce line length: Here is the query comparing current sales to prior and year ago sales: The first LAG function ( salespriorperiod ) partitions the data on gidp. cat. subcat. prod. gidt and orders the rows on all the time dimension columns. It gets the sales value of the prior period by passing an offset of 1. The second LAG function ( salessameperiodprioryear ) partitions the data on additional columns qtrnum. monnum. and daynum and orders it on yr so that, with an offset of 1, it can compute the year ago sales for the same period. The outermost SELECT clause computes the variances. Creating a Custom Member in a Dimension: Example In many analytical SQL tasks, it is helpful to define custom members in a dimension. For instance, you might define a specialized time period for analyses. You can use a partitioned outer join to temporarily add a member to a dimension. Note that the new SQL MODEL clause is suitable for creating more complex scenarios involving new members in dimensions. See Chapter 23, SQL for Modeling for more information on this topic. As an example of a task, what if we want to define a new member for our time dimension We want to create a 13th member of the Month level in our time dimension. This 13th month is defined as the summation of the sales for each product in the first month of each quarter of year 2001. The solution has two steps. Note that we will build this solution using the views and tables created in the prior example. Two steps are required. First, create a view with the new member added to the appropriate dimension. The view uses a UNION ALL operation to add the new member. To query using the custom member, use a CASE expression and a partitioned outer join. Our new member for the time dimension is created with the following view: In this statement, the view timec is defined by performing a UNION ALL of the edgetime view (defined in the prior example) and the user-defined 13th month. The gidt value of 8 was chosen to differentiate the custom member from the standard members. The UNION ALL specifies the attributes for a 13th month member by doing a SELECT from the DUAL table. Note that the grouping id, column gidt. is set to 8, and the quarter number is set to 5. Then, the second step is to use an inline view of the query to perform a partitioned outer join of cubeprodtime with timec. This step creates sales data for the 13th month at each level of product aggregation. In the main query, the analytic function SUM is used with a CASE expression to compute the 13th month, which is defined as the summation of the first months sales of each quarter. The SUM function uses a CASE to limit the data to months 1, 4, 7, and 10 within each year. Due to the tiny data set, with just 2 products, the rollup values of the results are necessarily repetitions of lower level aggregations. For more realistic set of rollup values, you can include more products from the Game Console and Y Box Games subcategories in the underlying materialized view. Miscellaneous Analysis and Reporting Capabilities This section illustrates the following additional analytic capabilities: WIDTHBUCKET Function For a given expression, the WIDTHBUCKET function returns the bucket number that the result of this expression will be assigned after it is evaluated. You can generate equiwidth histograms with this function. Equiwidth histograms divide data sets into buckets whose interval size (highest value to lowest value) is equal. The number of rows held by each bucket will vary. A related function, NTILE. creates equiheight buckets. Equiwidth histograms can be generated only for numeric, date or datetime types. So the first three parameters should be all numeric expressions or all date expressions. Other types of expressions are not allowed. If the first parameter is NULL. the result is NULL. If the second or the third parameter is NULL. an error message is returned, as a NULL value cannot denote any end point (or any point) for a range in a date or numeric value dimension. The last parameter (number of buckets) should be a numeric expression that evaluates to a positive integer value 0, NULL. or a negative value will result in an error. Buckets are numbered from 0 to ( n 1). Bucket 0 holds the count of values less than the minimum. Bucket( n 1) holds the count of values greater than or equal to the maximum specified value. WIDTHBUCKET Syntax The WIDTHBUCKET takes four expressions as parameters. The first parameter is the expression that the equiwidth histogram is for. The second and third parameters are expressions that denote the end points of the acceptable range for the first parameter. The fourth parameter denotes the number of buckets. Consider the following data from table customers. that shows the credit limits of 17 customers. This data is gathered in the query shown in Example 22-24 . In the table customers. the column custcreditlimit contains values between 1500 and 15000, and we can assign the values to four equiwidth buckets, numbered from 1 to 4, by using WIDTHBUCKET (custcreditlimit, 0, 20000, 4). Ideally each bucket is a closed-open interval of the real number line, for example, bucket number 2 is assigned to scores between 5000.0000 and 9999.9999. sometimes denoted 5000, 10000) to indicate that 5,000 is included in the interval and 10,000 is excluded. To accommodate values outside the range 0, 20,000), values less than 0 are assigned to a designated underflow bucket which is numbered 0, and values greater than or equal to 20,000 are assigned to a designated overflow bucket which is numbered 5 (num buckets 1 in general). See Figure 22-3 for a graphical illustration of how the buckets are assigned. Figure 22-3 Bucket Assignments You can specify the bounds in the reverse order, for example, WIDTHBUCKET ( custcreditlimit. 20000. 0. 4 ). When the bounds are reversed, the buckets will be open-closed intervals. In this example, bucket number 1 is ( 15000,20000 , bucket number 2 is ( 10000,15000 , and bucket number 4, is ( 0 ,5000 . The overflow bucket will be numbered 0 ( 20000. infinity ), and the underflow bucket will be numbered 5 (- infinity. 0 . It is an error if the bucket count parameter is 0 or negative. Example 22-24 WIDTHBUCKET The followin g query shows the bucket numbers for the credit limits in the customers table for both cases where the boundaries are specified in regular or reverse order. We use a range of 0 to 20,000. Linear Algebra Linear algebra is a branch of mathematics with a wide range of practical applications. Many areas have tasks that can be expressed using linear algebra, and here are some examples from several fields: statistics (multiple linear regression and principle components analysis), data mining (clustering and classification), bioinformatics (analysis of microarray data), operations research (supply chain and other optimization problems), econometrics (a nalysis of consumer demand data), and finance (asset allocation problems). Various libraries for linear algebra are freely available for anyone to use. Oracles UTLNLA package exposes matrix PLSQL data types and wrapper PLSQL subprograms for two of the most popular and robust of these libraries, BLAS and LAPACK. Linear algebra depends on matrix manipulation. Performing matrix manipulation in PLSQL in the past required inventing a matrix representation based on PLSQLs native data types and then writing matrix manipulation routines from scratch. This required substantial programming effort and the performance of the resulting implementation was limited. If developers chose to send data to external packages for processing rather than create their own routines, data transfer back and forth could be time consuming. Using the UTLNLA package lets data stay within Oracle, removes the programming effort, and delivers a fast implementation. Example 22-25 Linear Algebra Here is an example of how Oracles linear algebra support could be used for business analysis. It invokes a multiple linear regression application built using the UTLNLA package. The multiple regression application is implemented in an object called OLSRegression. Note that sample files for the OLS Regression object can be found in ORACLEHOMEplsqldemo . Consider the scenario of a retailer analyzing the effectiveness of its marketing program. Each of its stores allocates its marketing budget over the following possible programs: media advertisements ( media ), promotions ( promo ), discount coupons ( disct ), and direct mailers ( dmail ). The regression analysis builds a linear relationship between the amount of sales that an average store has in a given year ( sales ) and the spending on the four components of the marketing program. Suppose that the marketing data is stored in the following table: Then you can build the following sales-marketing linear model using coefficients: This model can be implemented as the following view, which refers to the OLS regression object: Using this view, a marketing program manager can perform an analysis such as Is this sales-marketing model reasonable for year 2004 data That is, is the multiple-correlation greater than some acceptable value, say, 0.9 The SQL for such a query might be as follows: You could also solve questions such as What is the expected base-line sales revenue of a store without any marketing programs in 2003 or Which component of the marketing program was the most effective in 2004 That is, a dollar increase in which program produced the greatest expected increase in sales See Oracle Database PLSQL Packages and Types Reference for further information regarding the use of the UTLNLA package and linear algebra. CASE Expressions Oracle now supports simple and searched CASE statements. CASE statements are similar in purpose to the DECODE statement, but they offer more flexibility and logical power. They are also easier to read than traditional DECODE statements, and offer better performance as well. They are commonly used when breaking categories into buckets like age (for example, 20-29, 30-39, and so on). The syntax for simple CASE statements is: Simple CASE expressions test if the expr value equals the comparisonexpr . The syntax for searched CASE statements is: You can use any kind of condition in a searched CASE expression, not just an equality test. You can specify only 65,535 arguments and each WHEN. THEN pair counts as two arguments. To avoid exceeding this limit, you can nest CASE expressions so that the returnexpr itself is a CASE expression. Example 22-26 CASE Suppose you wanted to find the average salary of all employees in the company. If an employees salary is less than 2000, you want the query to use 2000 instead. Without a CASE statement, you might choose to write this query as follows: Note that this runs against the hr sample schema. In this, foo is a function that returns its input if the input is greater than 2000, and returns 2000 otherwise. The query has performance implications because it needs to invoke a function for each row. Writing custom functions can also add to the development load. Using CASE expressions in the database without PLSQL, this query can be rewritten as: Using a CASE expression lets you avoid developing custom functions and can also perform faster. Example 22-27 CASE for Aggregating Independent Subsets Using CASE inside aggregate functions is a convenient way to perform aggregates on multiple subsets of data when a plain GROUP BY will not suffice. For instance, the preceding example could have included multiple AVG columns in its SELECT list, each with its own CASE expression. We might have had a query find the average salary for all employees in the salary ranges 0-2000 and 2000-5000. It would look like: Although this query places the aggregates of independent subsets data into separate columns, by adding a CASE expression to the GROUP BY clause we can display the aggregates as the rows of a single column. The next section shows the flexibility of this approach with two approaches to creating histograms with CASE . Creating Histograms You can use the CASE statement when you want to obtain histograms with user-defined buckets (both in number of buckets and width of each bucket). The following are two examples of histograms created with CASE statements. In the first example, the histogram totals are shown in multiple columns and a single row is returned. In the second example, the histogram is shown with a label column and a single column for totals, and multiple rows are returned. Example 22-28 Histogram Example 1 Example 22-29 Histogram Example 2 Frequent Itemsets Instead of counting how often a given event occurs (for example, how often someone has purchased milk at the grocery), you may find it useful to count how often multiple events occur together (for example, how often someone has purchased both milk and cereal together at the grocery store). You can count these multiple events using what is called a frequent itemset, which is, as the name implies, a set of items. Some examples of itemsets could be all of the products that a given customer purchased in a single trip to the grocery store (commonly called a market basket), the web pages that a user accessed in a single session, or the financial services that a given customer utilizes. The practical motivation for using a frequent itemset is to find those itemsets that occur most often. If you analyze a grocery stores point-of-sale data, you might, for example, discover that milk and bananas are the most commonly bought pair of items. Frequent itemsets have thus been used in business intelligence environments for many years, with the most common one being for market basket analysis in the retail industry. Frequent itemset calculations are integrated with the database, operating on top of relational tables and accessed through SQL. This integration provides the following key benefits: Applications that previously relied on frequent itemset operations now benefit from significantly improved performance as well as simpler implementation. SQL-based applications that did not previously use frequent itemsets can now be easily extended to take advantage of this functionality. Frequent itemsets analysis is performed with the PLSQL package DBMSFREQUENTITEMSETS. See Oracle Database PLSQL Packages and Types Reference for more information. In addition, there is an example of frequent itemset usage in Frequent itemsets . Scripting on this page enhances content navigation, but does not change the content in any way. PostgreSQL vs. MS SQL Server 0. Whats this all about I work as a data analyst in a global professional services firm (one you have certainly heard of). I have been doing this for about a decade. I have spent that decade dealing with data, database software, database hardware, database users, database programmers and data analysis methods, so I know a fair bit about these things. I frequently come into contact with people who know very little about these things ndash although some of them dont realise it . Over the years I have discussed the issue of PostgreSQL vs. MS SQL Server many, many times. A well-known principle in IT says: if youre going to do it more than once, automate it . This document is my way of automating that conversation. Unless otherwise stated I am referring to PostgreSQL 9.3 and MS SQL Server 2014, even though my experience with MS SQL Server is with versions 2008 R2 and 2012 ndash for the sake of fairness and relevance I want to compare the latest version of PostgreSQL to the latest version of MS SQL Server. Where I have made claims about MS SQL Server I have done my best to check that they apply to version 2014 by consulting Microsofts own documentation ndash although, for reasons I will get to. I have also had to rely largely on Google, Stack Overflow and the users of the internet. I know its not scientifically rigorous to do a comparison like this when I dont have equal experience with both databases, but this is not an academic exercise ndash its a real-world comparison. I have done my honest best to get my facts about MS SQL Server right ndash we all know it is impossible to bullshit the whole internet. If I find out that Ive got something wrong, Ill fix it. I am comparing the two databases from the point of view of a data analyst. Maybe MS SQL Server kicks PostgreSQLs arse as an OLTP backend (although I doubt it), but thats not what Im writing about here, because Im not an OLTP developerDBAsysadmin. Finally, there is an email address at top right. Do please use it if you wish I will do my best to respond. DISCLAIMER: all the subjective opinions in here are strictly my own. 1. Why PostgreSQL is way, way better than MS SQL Server Oops, spoiler alert. This section is a comparison of the two databases in terms of features relevant to data analytics. 1.1. CSV support CSV is the de facto standard way of moving structured (i. e. tabular) data around. All RDBMSes can dump data into proprietary formats that nothing else can read, which is fine for backups, replication and the like, but no use at all for migrating data from system X to system Y. A data analytics platform has to be able to look at data from a wide variety of systems and produce outputs that can be read by a wide variety of systems. In practice, this means that it needs to be able to ingest and excrete CSV quickly, reliably, repeatably and painlessly. Lets not understate this: a data analytics platform which cannot handle CSV robustly is a broken, useless liability. PostgreSQLs CSV support is top notch. The COPY TO and COPY FROM commands support the spec outlined in RFC4180 (which is the closest thing there is to an official CSV standard) as well as a multitude of common and not-so-common variants and dialects. These commands are fast and robust. When an error occurs, they give helpful error messages. Importantly, they will not silently corrupt, misunderstand or alter data. If PostgreSQL says your import worked, then it worked properly. The slightest whiff of a problem and it abandons the import and throws a helpful error message. (This may sound fussy or inconvenient, but it is actually an example of a well-established design principle. It makes sense: would you rather find out your import went wrong now, or a month from now when your client complains that your results are off) MS SQL Server can neither import nor export CSV. Most people dont believe me when I tell them this. Then, at some point, they see for themselves. Usually they observe something like: MS SQL Server silently truncating a text field MS SQL Servers text encoding handling going wrong MS SQL Server throwing an error message because it doesnt understand quoting or escaping (contrary to popular belief, quoting and escaping are not exotic extensions to CSV. They are fundamental concepts in literally every human-readable data serialisation specification. Dont trust anyone who doesnt know what these things are) MS SQL Server exporting broken, useless CSV Microsofts horrendous documentation. How did they manage to overcomplicate something as simple as CSV This is especially baffling because CSV parsers are trivially easy to write (I wrote one in C and plumbed it into PHP a year or two ago, because I wasnt happy with its native CSV-handling functions. The whole thing took perhaps 100 lines of code and three hours ndash two of which were spent getting to grips with SWIG. which was new to me at the time). If you dont believe me, download this correctly-formatted, standards-compliant UTF-8 CSV file and use MS SQL Server to calculate the average string length (i. e. number of characters) of the last column in this file (it has 50 columns). Go on, try it. (The answer youre looking for is exactly 183.895.) Naturally, determining this is trivially easy in PostgreSQL ndash in fact, the most time-consuming bit is creating a table with 50 columns to hold the data. Poor understanding of CSV seems to be endemic at Microsoft that file will break Access and Excel too. Sad but true: some database programmers I know recently spent a lot of time and effort writing Python code which sanitises CSV in order to allow MS SQL Server to import it. They werent able to avoid changing the actual data in this process, though. This is as crazy as spending a fortune on Photoshop and then having to write some custom code to get it to open a JPEG, only to find that the image has been altered slightly. 1.2. Ergonomics Every data analytics platform worth mentioning is Turing complete, which means, give or take, that any one of them can do anything that any other one can do. There is no such thing as you can do X in software A but you cant do X in software B. You can do anything in anything ndash all that varies is how hard it is. Good tools make the things you need to do easy poor tools make them hard. Thats what it always boils down to. (This is all conceptually true, if not literally true - for example, no RDBMS I know of can render 3D graphics. But any one of them can emulate any calculation a GPU can perform.) PostgreSQL is clearly written by people who actually care about getting stuff done . MS SQL Server feels like it was written by people who never have to actually use MS SQL Server to achieve anything. Here are a few examples to back this up: PostgreSQL supports DROP TABLE IF EXISTS. which is the smart and obvious way of saying if this table doesnt exist, do nothing, but if it does, get rid of it. Something like this: Heres how you have to do it in MS SQL Server: Yes, its only one extra line of code, but notice the mysterious second parameter to the OBJECTID function. You need to replace that with NV to drop a view. Its NP for a stored procedure. I havent learned all the different letters for all the different types of database objects (why should I have to) Notice also that the table name is repeated unnecessarily. If your concentration slips for a moment, its dead easy to do this: See whats happened there This is a reliable source of annoying, time-wasting errors. PostgreSQL supports DROP SCHEMA CASCADE. which drops a schema and all the database objects inside it. This is very, very important for a robust analytics delivery methodology, where tear-down-and-rebuild is the underlying principle of repeatable, auditable, collaborative analytics work. There is no such facility in MS SQL Server. You have to drop all the objects in the schema manually, and in the right order . because if you try to drop an object on which another object depends, MS SQL Server simply throws an error. This gives an idea of how cumbersome this process can be. PostgreSQL supports CREATE TABLE AS. A wee example: This means you can highlight everything but the first line and execute it, which is a useful and common task when developing SQL code. In MS SQL Server, table creation goes like this instead: So, to execute the plain SELECT statement, you have to comment out or remove the INTO bit. Yes, commenting out two lines is easy thats not the point. The point is that in PostgreSQL you can perform this simple task without modifying the code and in MS SQL Server you cant, and that introduces another potential source of bugs and annoyances. In PostgreSQL, you can execute as many SQL statements as you like in one batch as long as youve ended each statement with a semicolon, you can execute whatever combination of statements you like. For executing automated batch processes or repeatable data builds or output tasks, this is critically important functionality. In MS SQL Server, a CREATE PROCEDURE statement cannot appear halfway through a batch of SQL statements. Theres no good reason for this, its just an arbitrary limitation. It means that extra manual steps are often required to execute a large batch of SQL. Manual steps increase risk and reduce efficiency. PostgreSQL supports the RETURNING clause, allowing UPDATE. INSERT and DELETE statements to return values from affected rows. This is elegant and useful. MS SQL Server has the OUTPUT clause, which requires a separate table variable definition to function. This is clunky and inconvenient and forces a programmer to create and maintain unnecessary boilerplate code. PostgreSQL supports string quoting, like so: This is extremely useful for generating dynamic SQL because (a) it allows the user to avoid tedious and unreliable manual quoting and escaping when literal strings are nested and (b) since text editors and IDEs tend not to recogniise as a string delimiter, syntax highlighting remains functional even in dynamic SQL code. PostgreSQL lets you use procedural languages simply by submitting code to the database engine you write procedural code in Python or Perl or R or JavaScript or any of the other supported languages (see below) right next to your SQL, in the same script. This is convenient, quick, maintainable, easy to review, easy to reuse and so on. In MS SQL Server, you can either use the lumpy, slow, awkward T-SQL procedural language, or you can use a language to make an assembly and load it into the database. This means your code is in two separate places and you have to go through a sequence of GUI-based manual steps to alter it. It makes packaging up all your stuff into one place harder and more error-prone. And there are plenty more examples out there. Each of these things, in isolation, may seem like a relatively minor niggle however, the overall effect is that getting real work done in MS SQL Server is significantly harder and more error-prone than in PostgreSQL, and data analysts spend valuable time and energy on workarounds and manual processes instead of focusing on the actual problem. Update: it was pointed out to me that one really useful feature MS SQL Server has which PostgreSQL lacks is the ability to declare variables in SQL scripts. Like this: PostgreSQL cant do this. I wish it could, because there are an awful lot of uses for such a feature. 1.3. You can run PostgreSQL in Linux, BSD etc. (and, of course, Windows) Anyone who follows developments in IT knows that cross-platform is a thing now. Cross-platform support is arguably the killer feature of Java, which is actually a somewhat lumpy, ugly programming language, but nonetheless enormously successful, influential and widespread. Microsoft no longer has the monopoly it once enjoyed on the desktop, thanks to the rise of Linux and Apple. IT infrastructures are increasingly heterogeneous thanks to the flexibility of cloud services and easy access to high-performance virtualisation technology. Cross-platform software is about giving the user control over their infrastructure. (At work I currently manage several PostgreSQL databases, some in Windows and some in Ubuntu Linux. I and my colleagues freely move code and database dumps between them. We use Python and PHP because they also work in both operating systems. It all just works.) Microsofts policy is and always has been vendor lock-in. They dont open-source their code they dont provide cross-platform versions of their software they even invented a whole ecosystem. NET, designed to draw a hard line between Microsoft users and non-Microsoft users. This is good for them, because it safeguards their revenue. It is bad for you, the user, because it restricts your choices and creates unnecessary work for you. ( Update: a couple of days after I published this, Microsoft made me look like a prat by announcing that it was open-sourcing . Thats a great step, but lets not crack open the Bollinger just yet .) Now, this is not a Linux vs. Windows document, although Im sure Ill end up writing one of those at some point. Suffice it to say that, for real IT work, Linux (and the UNIX-like family: Solaris, BSD etc.) leaves Windows in the dust. UNIX-like operating systems dominate the server market, cloud services, supercomputing (in this field its a near-monopoly) and technical computing, and with good reason ndash these systems are designed by techies for techies. As a result they trade user-friendliness for enormous power and flexibility. A proper UNIX-like OS is not just a nice command line ndash it is an ecosystem of programs, utilities, functionality and support that makes getting real work done efficient and enjoyable. A competent Linux hacker can achieve in a single throwaway line of Bash script a task which would be arduous and time-consuming in Windows. (Example: the other day I was looking through a friends film collection and he said he thought the total number of files in the file system was high, considering how many films he had, and he wondered if maybe he had accidentally copied a large folder structure into one of his film folders. I did a recursive count of files-per-folder for him like this: The whole thing took about a minute to write and a second to run. It confirmed that some of his folders had a problem and told him which ones they were. How would you do this in Windows) For data analytics, an RDBMS doesnt exist in a vacuum it is part of a tool stack. Therefore its environment matters. MS SQL Server is restricted to Windows, and Windows is simply a poor analytics environment. 1.4. Procedural language features This is a biggie. Pure declarative SQL is good at what it was designed for ndash relational data manipulation and querying. You quickly reach its limits if you try to use it for more involved analytical processes, such as complex interest calculations, time series analysis and general algorithm design. SQL database providers know this, so almost all SQL databases implement some kind of procedural language. This allows a database user to write imperative-style code for more complex or fiddly tasks. PostgreSQLs procedural language support is exceptional . Its impossible to do justice to it in a short space, but heres a sample of the goods. Any of these procedural languages can be used for writing stored procedures and functions or simply dumped into a block of code to be executed inline. PLPGSQL: this is PostgreSQLs native procedural language. Its like Oracles PLSQL, but more modern and feature-complete. PLV8: the V8 JavaScript engine from Google Chrome is available in PostgreSQL. This engine is stable, feature-packed and absurdly fast ndash often approaching the execution speed of compiled, optimised C. Combine that with PostgreSQLs native support for the JSON data type (see below) and you have ultimate power and flexibility in a single package. Even better, PLV8 supports global (i. e. cross-function call) state, allowing the user to selectively cache data in RAM for fast random access. Suppose you need to use 100,000 rows of data from table A on each of 1,000,000 rows of data from table B. In traditional SQL, you either need to join these tables (resulting in a 100bn row intermediate table, which will kill any but the most immense server) or do something akin to a scalar subquery (or, worse, cursor-based nested loops), resulting in crippling IO load if the query planner doesnt read your intentions properly. In PLV8 you simply cache table A in memory and run a function on each of the rows of table B ndash in effect giving you RAM-quality access (negligible latency and random access penalty no non-volatile IO load) to the 100k-row table. I did this on a real piece of work recently ndash my PostgreSQLPLV8 code was about 80 times faster than the MS T-SQL solution and the code was much smaller and more maintainable. Because it took about 23 seconds instead of half an hour to run, I was able to run 20 run-test-modify cycles in an hour, resulting in feature-complete, properly tested, bug-free code. Look here for more detail on this. (All those run-test-modify cycles were only possible because of DROP SCHEMA CASCADE and freedom to execute CREATE FUNCTION statements in the middle of a statement batch, as explained above. See how nicely it all fits together) PLPython: you can use full Python in PostgreSQL. Python2 or Python 3, take your pick, and yes, you get the enormous ecosystem of libraries for which Python is justifiably famous. Fancy running a SVM from scikit-learn or some arbitrary-precision arithmetic provided by gmpy2 in the middle of a SQL query No problem PLPerl: Perl has been falling out of fashion for some time, but its versatility earned it a reputation as the Swiss army knife of programming languages. In PostgreSQL you have full Perl as a procedural language. PLR: R is the de facto standard statistical programming environment in academia and data science, and with good reason - it is free, robust, fully-featured and backed by an enormous library of high-quality plugins and add-ons. PostgreSQL lets you use R as a procedural language. Java, Lua, sh, Tcl, Ruby and PHP are also supported as procedural languages in PostgreSQL. C: doesnt quite belong in this list because you have to compile it separately, but its worth a mention. In PostgreSQL it is trivially easy to create functions which execute compiled, optimised C (or C or assembler) in the database backend. This is a power user feature which provides unrivalled speed and fine control of memory management and resource usage for tasks where performance is critical. I have used this to implement a complex, stateful payment processing algorithm operating on a million rows of data per second ndash and that was on a desktop PC. MS SQL Servers inbuilt procedural language (part of their T-SQL extension to SQL) is clunky, slow and feature-poor. It is also prone to subtle errors and bugs, as Microsofts own documentation sometimes acknowledges. I have never met a database user who likes the T-SQL procedural language. What about the fact that you can make assemblies in languages and then use them in MS SQL Server This doesnt count as procedural language support because you cant submit this code to the database engine directly. Manageability and ergonomics are critically important. Inserting some Python code inline in your database query is easy and convenient firing up Visual Studio, managing projects and throwing DLL files around (all in GUI-based processes which cannot be properly scripted, version-controlled, automated or reviewed) is awkward, error-prone and non-scalable. In any case, this mechanism is limited to languages. 1.5. Native regular expression support Regular expressons (regexen or regexes) are as fundamental to analytics work as arithmetic ndash they are the first choice (and often only choice) for a huge variety of text processing tasks. A data analytics tool without regex support is like a bicycle without a saddle ndash you can still use it, but its painful. PostgreSQL has smashing out-of-the-box support for regex. Some examples: Get all lines starting with a repeated digit followed by a vowel: Get the first isolated hex string occurring in a field: Break a string on whitespace and return each fragment in a separate row: Case-insensitively find all words in a string with at least 10 letters: MS SQL Server has LIKE. SUBSTRING. PATINDEX and so on, which are not comparable to proper regex support (if you doubt this, try implementing the above examples using them). There are third-party regex libraries for MS SQL Server theyre just not as good as PostgreSQLs support, and the need to obtain and install them separately adds admin overhead. Note also that PostgreSQLs extensive procedural language support also gets you several other regex engines and their various features - e. g. Pythons regex library provides the added power of positive and negative lookbehind assertions. This is in keeping with the general theme of PostgreSQL giving you all the tools you need to actually get things done. 1.6. Custom aggregate functions This is a feature that, technically, is offered by both PostgreSQL and MS SQL Server. The implementations differ hugely, though. In PostgreSQL, custom aggregates are convenient and simple to use, resulting in fast problem-solving and maintainable code: Elegant, eh A custom aggregate is specified in terms of an internal state and a way to modify that state when we push new values into the aggregate function. In this case we start each customer off with zero balance and no interest accrued, and on each day we accrue interest appropriately and account for payments and withdrawals. We compound the interest on the 1st of every month. Notice that the aggregate accepts an ORDER BY clause (since, unlike SUM. MAX and MIN. this aggregate is order-dependent) and PostgreSQL provides operators for extracting values from JSON objects. So, in 28 lines of code weve created the framework for monthly compounding interest on bank accounts and used it to calculate final balances. If features are to be added to the methodology (e. g. interest rate modifications depending on debitcredit balance, detection of exceptional circumstances), its all right there in the transition function and is written in an appropriate language for implementing complex logic. (Tragic side-note: I have seen large organisations spend tens of thousands of pounds over weeks of work trying to achieve the same thing using poorer tools.) MS SQL Server, on the other hand, makes it absurdly difficult . Incidentally, the examples in the second link are for implementing a simple string concatenation aggregate. Note the huge amount of code and gymnastics required to implement this simple function (which PostgreSQL provides out of the box, incidentally. Probably because its useful). MS SQL Server also does not allow an order to be specified in the aggregate, which renders this function useless for my kind of work ndash with MS SQL Server, the order of string concatenation is random, so the results of a query using this function are non-deterministic (they might change from run to run) and the code will not pass a quality review. The lack of ordering support also breaks code such as the interest calculation example above. As far as I can tell, you just cant do this using an MS SQL Server custom aggregate. (It is actually possible to make MS SQL Server do a deterministic string concatenation aggregation in pure SQL but you have to abuse the RECURSIVE query functionality to do it. Although an interesting academic exercise, this results in slow, unreadable, unmaintainable code and is not a real-world solution). 1.7. Unicode support Long gone are the days when ASCII was universal, character and byte were fungible terms and foreign (from an Anglocentric standpoint) text was an exotic exception. Proper international language support is no longer optional. The solution to all this is Unicode. There are a lot of misconceptions about Unicode out there. Its not a character set, its not a code page, its not a file format and its nothing whatsoever to do with encryption. An exploration of how Unicode works is fascinating but beyond the scope of this document ndash I heartily recommend Googling it and working through a few examples. The key points about Unicode that are relevant to database functionality are: Unicode-encoded text (for our purposes this means either UTF-8 or UTF-16) is a variable-width encoding. In UTF-8 a character can take one, two, three or four bytes to represent. In UTF-16 its either two or four. This means that operations like taking substrings and measuring string lengths need to be Unicode-aware to work properly. Not all sequences of bytes are valid Unicode. Manipulating valid Unicode without knowing its Unicode is likely to produce something that is not valid Unicode. UTF-8 and UTF-16 are not compatible. If you take one file of each type and concatenate them, you (probably) end up with a file which is neither valid UTF-8 nor valid UTF-16. For text which mostly fits into ASCII, UTF-8 is about twice as space-efficient as UTF-16. PostgreSQL supports UTF-8. Its CHAR. VARCHAR and TEXT types are, by default, UTF-8, meaning they will only accept UTF-8 data and all the transformations applied to them, from string concatenation and searching to regular expressions, are UTF-8-aware. It all just works. MS SQL Server 2008 does not support UTF-16 it supports UCS-2, a deprecated subset of UTF-16. What this means is that most of the time, it will look like its working fine, and occasionally, it will silently corrupt your data. Since it interprets text as a string of wide (i. e. 2-byte) characters, it will happily cut a 4-byte UTF-16 character in half. At best, this results in corrupted data. At worst, something else in your toolchain will break badly and youll have a disaster on your hands. Apologists for MS are quick to point out that this is unlikely because it would require the data to contain something outside Unicodes basic multilingual plane. This is completely missing the point. A databases sole purpose is storing, retreiving and manipulating data. A database which can be broken by putting the wrong data in it is as useless as a router that breaks if you download the wrong file. MS SQL Server versions since 2012 have supported UTF-16 properly, if you ensure you select a UTF-16-compliant collation for your database. It is baffling that this is (a) optional and (b) implemented as late as 2012. Better late than never, I suppose. 1.8. Data types that work properly A common misconception is that all databases have the same types ndash INT. CHAR. DATE and so on. هذا ليس صحيحا. PostgreSQLs type system is really useful and intuitive, free of annoyances which introduce bugs or slow work down and, as usual, apparently designed with productivity in mind. MS SQL Servers type system, by comparison, feels like beta software. It cant touch the feature set of PostgreSQLs type system and it is beset with traps waiting to ensnare the unwary user. Lets take a look: CHAR, VARCHAR and family PostgreSQL: the docs actively encourage you to simply use the TEXT type. This is a high-performance, UTF-8 validated text storage type which stores strings up to 1GB in size. It supports all the text operations PostgreSQL is capable of: simple concatenation and substringing regex searching, matching and splitting full-text search casting character transformation and so on. If you have text data, stick it in a TEXT field and carry on. Moreover, since anything in a TEXT field (or, for that matter, CHAR or VARCHAR fields) must be UTF-8, there is no issue with encoding incompatibility. Since UTF-8 is the de facto universal text encoding, converting text to it is easy and reliable. Since UTF-8 is a superset of ASCII, this conversion is often trivially easy or altogether unnecessary. It all just works. MS SQL Server: its a pretty sad story. The TEXT and NTEXT types exist and stretch to 2GB. Bafflingly, though, they dont support casting. Also, dont use them, says MS ndash they will be removed in a future version of MS SQL Server. You should use CHAR. VARCHAR and their N - prefixed versions instead. Unfortunately, VARCHAR(MAX) has poor performance characteristics and VARCHAR(8000) (the next biggest size, for some reason) tops out at 8,000 bytes. (Its 4,000 characters for NVARCHAR .) Remember how PostgreSQLs insistence on a single text encoding per database makes everything work smoothly Not so in MS-land: As with earlier versions of SQL Server, data loss during code page translations is not reported. link In other words, MS SQL Server might corrupt your data, and you wont know about it until something else goes wrong. This is, quite simply, a deal-breaker. A data analytics platform which might silently change, corrupt or lose your data is an enormous liability. Consider the absurdity of forking out for a server using expensive ECC RAM as a defence against data corruption caused by cosmic rays, and then running software on it which might corrupt your data anyway. Date and time types PostgreSQL: you get DATE. TIME. TIMESTAMP and TIMESTAMP WITH TIME ZONE. all of which do exactly what you would expect. They also have fantastic range and precision, supporting microsecond resolution from the 5th millennium BC to almost 300 millennia in the future. They accept input in a wide variety of formats and the last one has full support for time zones. They can be converted to and from Unix time, which is very important for interoperability with other systems. They can take the special values infinity and - infinity. This is not a metaphysico-theologico-philosophical statement, but a hugely useful semantic construction. For example, set a users password expiry date to infinity to denote that they do not have to change their password. The standard way of doing this is to use NULL or some date far in the future, but these are clumsy hacks ndash they both involve putting inaccurate information in the database and writing application logic to compensate. What happens when a developer sees NULL or 3499-12-31. If youre lucky, he knows the secret handshakes and isnt confused by it. If not, he assumes either that the date is unknown or that it really does refer to the 4th millennium, and you have a problem. The cumulative effect of hacks, workarounds and kludges like this is unreliable systems, unhappy programmers and increased business risk. Helpful semantics like infinity and - infinity allow you to say what you mean and write consistent, readable application logic. They also support the INTERVAL type, which is so useful it has its own section right after this one. Casting and conversion of date and time types is easy and intuitive - you can cast any type to TEXT. and the tochar and totimestamp functions give you ultimate flexibility, allowing conversion in both directions using format strings. For example: and, going in the other direction, As usual, it just works. As a data analyst, I care very much about a databases date-handling ability, because dates and times tend to occur in a multitude of different formats and they are usually critical to the analysis itself. MS SQL Server: dates can only have positive 4-digit years, so they are restricted to 0001 AD to 9999 AD. They do not support infinity and - infinity. They do not support interval types, so date arithmetic is tedious and clunky. You can convert them to and from UNIX time, but its a hack involving adding seconds to the UNIX epoch, 1970-01-01T00:00:00Z, which you therefore have to know and be willing to hardcode into your application. Date conversion deserves a special mention, because even by MS SQL Servers shoddy standards its bloody awful. The CONVERT function takes the place of PostgreSQLs tochar and totimestamp. but it works like this: Thats right ndash youre simply expected to know that 126 is the code for converting strings in that format to a datetime. MSDN provides a table of these magic numbers. I didnt give the same example as for PostgreSQL because I couldnt find a magic number corresponding to the right format for Saturday 03 Feb 2001. If someone gave you data with such dates in it, I guess youd have to do some string manipulation (pity the string manipulation facilities in MS SQL Server are almost non-existent. ) PostgreSQL: the INTERVAL type represents a period of time, such as 30 microseconds or 50 years. It can also be negative, which may seem counterintuitive until you remember that the word ago exists. PostgreSQL also knows about ago, in fact, and will accept strings like 1 day ago as interval values (this will be internally represented as an interval of -1 days). Interval values let you do intuitive date arithmetic and store time durations as first-class data values. They work exactly as you expect and can be freely casted and converted to and from anything which makes sense. MS SQL Server: no support for interval types. PostgreSQL: arrays are supported as a first-class data type, meaning fields in tables, variables in PLPGSQL, parameters to functions and so on can be arrays. Arrays can contain any data type you like, including other arrays. This is very, very useful . Here are some of the things you can do with arrays: Store the results of function calls with arbitrarily-many return values, such as regex matches Represent a string as integer word IDs, for use in fast text matching algorithms Aggregation of multiple data values across groups, for efficient cross-tabulation Perform row operations using multiple data values without the expense of a join Accurately and semantically represent array data from other applications in your tool stack Feed array data to other applications in your tool stack I cant think of any programming languages which dont support arrays, other than crazy ones like Brainfuck and Malbolge. Arrays are so useful that they are ubiquitous. Any system, especially a data analytics platform, which doesnt support them is crippled. MS SQL Server: no support for arrays. PostgreSQL: full support for JSON, including a large set of utility functions for transforming between JSON types and tables (in both directions), retreiving values from JSON data and constructing JSON data. Parsing and stringification are handled by simple casts, which as a rule in PostgreSQL are intelligent and robust. The PLV8 procedural language works as seamlessly as you would expect with JSON ndash in fact, a JSON-type internal state in a custom aggregate (see this example) whose transition function is written in PLV8 provides a declarativeimperative best-of-both-worlds so powerful and convenient it feels like cheating. JSON (and its variants, such as JSONB) is of course the de facto standard data transfer format on the web and in several other data platforms, such as MongoDB and ElasticSearch, and in fact any system with a RESTful interface. Aspiring Analytics-as-a-Service providers take note. MS SQL Server: no support for JSON. PostgreSQL: HSTORE is a PostgreSQL extension which implements a fast key-value store as a data type. Like arrays, this is very useful because virtually every high-level programming language has such a concept (and virtually every programming language has such a concept because it is very useful). JavaScript has objects, PHP has associative arrays, Python has dicts, C has std::map and std::unorderedmap. Go has maps. وما إلى ذلك وهلم جرا. In fact, the notion of a key-value store is so important and useful that there exists a whole class of NoSQL databases which use it as their main storage paradigm. Theyre called, uh, key-value stores . There are also some fun unexpected uses of such a data type. A colleague recently asked me if there was a good way to deduplicate a text array. Heres what I came up with: i. e. put the array into both the keys and values of an HSTORE, forcing a dedupe to take place (since key values are unique) then retrieve the keys from the HSTORE. Theres that PostgreSQL versatility again. MS SQL Server: No support for key-value storage. Range types PostgreSQL: range types represent, well, ranges. Every database programmer has seen fields called startdate and enddate. and most of them have had to implement logic to detect overlaps. Some have even found, the hard way, that joins to ranges using BETWEEN can go horribly wrong, for a number of reasons. PostgreSQLs approach is to treat time ranges as first-class data types. Not only can you put a range of time (or INT s or NUMERIC s or whatever) into a single data value, you can use a host of built-in operators to manipulate and query ranges safely and quickly. You can even apply specially-developed indices to them to massively accelerate queries that use these operators. In short, PostgreSQL treats ranges with the importance they deserve and gives you the tools to work with them effectively. Im trying not to make this document a mere list of links to the PostgreSQL docs, but just this once, I suggest you go and see for yourself . (Oh, and if the pre-defined types dont meet your needs, you can define your own ones. You dont have to touch the source code, the database exposes methods to allow you to do this.) MS SQL Server: no support for range types. NUMERIC and DECIMAL PostgreSQL: NUMERIC (and DECIMAL - theyre symonyms) is near-as-dammit arbitrary precision: it supports 131,072 digits before the decimal point and 16,383 digits after the decimal point. If youre running a bank, doing technical computation, landing spaceships on comets or simply doing something where you cannot tolerate rounding errors, youre covered. MS SQL Server: NUMERIC (and DECIMAL - theyre symonyms) supports a maximum of 38 decimal places of precision in total. PostgreSQL: XML is supported as a data type and the database offers a variety of functions for working with XML. Xpath querying is supported. MS SQL Server: finally, some good news MS SQL Server has an XML data type too, and offers plenty of support for working with it. (Shame XML is going out of style. ) 1.9. Scriptability PostgreSQL can be driven entirely from the command line, and since it works in operating systems with proper command lines (i. e. everything except Windows), this is highly effective and secure. You can SSH to a server and configure PostgreSQL from your mobile phone, if you have to (I have done so more than once). You can automate deployment, performance-tuning, security, admin and analytics tasks with scripts. Scripts are very important because unlike GUI processes, they can be copied, version-controlled, documented, automated, reviewed, batched and diffed. For serious work, text editors and command lines are king. MS SQL Server is driven through a GUI. I dont know to what extent it can be automated with Powershell I do know that if you Google for help and advice on getting things done in MS SQL Server, you get a lot of people saying right-click on your database, then click on Tasks. . GUIs do not work well across low-bandwidth or high-latency connections text-based shells do. As I write I am preparing to do some sysadmin on a server 3,500 miles away, on a VPN via a shaky WiFi hotspot, and thanking my lucky stars its an UbuntuPostgreSQL box. (Who on Earth wants a GUI on a server anyway) 1.10. Good external language bindings PostgreSQL is very, very easy to connect to and use from programming environments, because libpq, its external API, is very well-designed and very well-documented. This means that writing utilities which plug into PostgreSQL is very easy and convenient, which makes the database more versatile and a better fit in an analytics stack. On many occasions I have knocked up a quick program in C or C which connects to PostgreSQL, pulls some data out and does some heavy calculations on it, e. g. using multithreading or special CPU instructions - stuff the database itself is not suitable for. I have also written C programs which use setuid to allow normal users to perform certain administrative tasks in PostgreSQL. It is very handy to be able to do this quickly and neatly. MS SQL Servers external language bindings vary. Sometimes you have to install extra drivers. Sometimes you have to create classes to store the data you are querying, which means knowing at compile time what that data looks like. Most importantly, the documentation is a confusing, tangled mess, which makes getting this done unnecessarily time-consuming and painful. 1.11. Documentation Data analytics is all about being a jack of all trades. We use a very wide variety of programming languages and tools. (Off the top of my head, the programmingscripting languages I currently work with are PHP, JavaScript, Python, R, C, C, Go, three dialects of SQL, PLPGSQL and Bash.) It is hopelessly unrealistic to expect to learn everything you will need to know up front. Getting stuff done frequently depends on reading documentation. A well-documented tool is more useful and allows analysts to be more productive and produce higher-quality work. PostgreSQLs documentation is excellent. Everything is covered comprehensively but the documents are not merely reference manuals ndash they are full of examples, hints, useful advice and guidance. If you are an advanced programmer and really want to get stuck in, you can also simply read PostgreSQLs source code, all of which is openly and freely available. The docs also have a sense of humour: The first century starts at 0001-01-01 00:00:00 AD, although they did not know it at the time. This definition applies to all Gregorian calendar countries. There is no century number 0, you go from -1 century to 1 century. If you disagree with this, please write your complaint to: Pope, Cathedral Saint-Peter of Roma, Vatican. MS SQL Servers documentation is all on MSDN, which is an unfriendly, sprawling mess. Because Microsoft is a large corporation and its clients tend to be conservative and humourless, the documentation is business appropriate ndash i. e. officious, boring and dry. Not only does it lack amusing references to the historical role of Catholicism in the development of date arithmetic, it is impenetrably stuffy and hidden behind layers of unnecessary categorisation and ostentatiously capitalised official terms. Try this: go to the product documentation page for MS SQL Server 2012 and try to get from there to something useful. Or try reading this gem (not cherry-picked, I promise): A report part definition is an XML fragment of a report definition file. You create report parts by creating a report definition, and then selecting report items in the report to publish separately as report parts. Has the word report started to lose its meaning yet (And, of course, MS SQL Server is closed source, so you cant look at the source code. Yes, I know source code is not the same as documentation, but it is occasionally surprisingly useful to be able to simply grep the source for a relevant term and cast an eye over the code and the comments of the developers. Its easy to think of our tools as magical black boxes and to forget that even something as huge and complex as an RDBMS engine is, after all, just a list of instructions written by humans in a human-readable language.) 1.12. Logging thats actually useful MS SQL Servers logs are spread across several places - error logs, Windows event log, profiler logs, agent logs and setup log. To access these you need varying levels of permissions and you have to use various tools, some of which are GUI-only. Maybe things like Splunk can help to automate the gathering and parsing of these logs. I havent tried, nor do I know anyone else who has. Google searches on the topic produce surprisingly little information, surprisingly little of which is of any use. PostgreSQLs logs, by default, are all in one place. By changing a couple of settings in a text file, you can get it to log to CSV (and since were talking about PostgreSQL, its proper CSV, not broken CSV). You can easily set the logging level anywhere from dont bother logging anything to full profiling and debugging output. The documentation even contains DDL for a table into which the CSV-format logs can be conveniently imported. You can also log to stderr or the system log or to the Windows event log (provided youre running PostgreSQL in Windows, of course). The logs themselves are human-readable and machine-readable and contain data likely to be of great value to a sysadmin. Who logged in and out, at what times, and from where Which queries are being run and by whom How long are they taking How many queries are submitted in each batch Because the data is well-formatted CSV, it is trivially easy to visualise or analyse it in R or PostgreSQL itself or Pythons matplotlib or whatever you like. Overlay this with the wealth of information that Linux utilities like top, iotop and iostat provide and you have easy, reliable access to all the server telemetry you could possibly need. 1.13. Support How is PostgreSQL going to win this one Everyone knows that expensive flagship enterprise products by big commercial vendors have incredible support, whereas free software doesnt have any Of course, this is nonsense. Commercial products have support from people who support it because they are paid to. They do the minimum amount necessary to satisfy the terms of the SLA. As I type this, some IT professionals I know are waiting for a major hardware vendor to help them with a performance issue in a 40,000 server. Theyve been discussing it with the vendor for weeks theyve spent time and effort running extensive tests and benchmarks at the vendors request and so far the vendors reaction has been a mixture of incompetence, fecklessness and apathy. The 40,000 server is sitting there performing very, very slowly, and its users are working 70-hour weeks to try to stay on schedule. Over the years I have seen many, many problems with expensive commercial software ndash everything from bugs to performance issues to incompatibility to insufficient documentation. Sometimes these problems cause a late night or a lost weekend for the user sometimes they cause missed deadlines and angry clients sometimes it goes as far as legal and reputational risk. Every single time, the same thing happens: the problem is fixed by the end users, using a combination of blood, sweat, tears, Google and late nights. I have never seen the vendor swoop to the rescue and make everything OK. So what is the support for PostgreSQL like On the two occasions I have asked the PostgreSQL mailing list for help, I have received replies from Tom Lane within 24 hours. Take a moment to click on the link and read the wiki - the guy is not just a lead developer of PostgreSQL, hes a well-known computer programmer. Needless to say, his advice is as good as advice gets. On one of the occasions, where I asked a question about the best way to implement cross-function call persistent memory allocation, Lane replied with the features of PostgreSQL I should study and suggested solutions to my problem ndash and for good measure he threw in a list of very good reasons why my tentative solution (a C static variable) was rubbish. You cant buy that kind of support, but you can get it from a community of enthusiastic open source developers. Oh, did I mention that the total cost of the database software and the helpful advice and recommendations from the acclaimed programmer was 0.00 Note that by support I mean help getting it to work properly. Some people (usually people who dont actually use the product) think of support contracts more in terms of legal coverage ndash theyre not really interested in whether help is forthcoming or not, but they like that theres someone to shout at and, more importantly, blame. I discuss this too, here . (And if youre really determined to pay someone to help you out, you can of course go to any of the organisations which provide professional support for PostgreSQL. Unlike commercial software vendors, whose support functions are secondary to their main business of selling products, these organisations live or die by the quality of the support they provide, so it is very good.) 1.14. Flexible, scriptable database dumps Ive already talked about scriptability, but database dumps are very important, so they get their own bit here. PostgreSQLs dump utility is extremely flexible, command-line driven (making it easily automatable and scriptable) and well-documented (like the rest of PostgreSQL). This makes database migration, replication and backups ndash three important and scary tasks ndash controllable, reliable and configurable. Moreover, backups can be in a space-effecient compressed format or in plain SQL, complete with data, making them both human-readable and executable. A backup can be of a single table or of a whole database cluster. The user gets to do exactly as he pleases. With a little work and careful selection of options, it is even possible to make a DDL-only plain SQL PostgreSQL backup executable in a different RDBMS. MS SQL Servers backups are in a proprietary, undocumented, opaque binary format. 1.15. Reliability Neither PostgreSQL nor MS SQL Server are crash-happy, but MS SQL Server does have a bizarre failure mode which I have witnessed more than once: its transaction logs become enormous and prevent the database from working. In theory the logs can be truncated or deleted but the documentation is full of dire warnings against such action. PostgreSQL simply sits there working and getting things done. I have never seen a PostgreSQL database crash in normal use. PostgreSQL is relatively bug-free compared to MS SQL Server. I once found a bug in PostgreSQL 8.4 ndash it was performing a string distance calculation algorithm wrongly. This was a problem for me because I needed to use the algorithm in some fuzzy deduplication code I was writing for work. I looked up the algorithm on Wikipedia, gained a rough idea of how it works, found the implementation in the PostgreSQL source code, wrote a fix and emailed it to one of the PostgreSQL developers. In the next release of PostgreSQL, version 9.0, the bug was fixed. Meanwhile, I applied my fix to my own installation of PostgreSQL 8.4, re-compiled it and kept working. This will be a familiar story to many of the users of PostgreSQL, and indeed any large piece of open source software. The community benefits from high-quality free software, and individuals with the appropriate skills do what they can to contribute. Everyone wins. With a closed-source product, you cant fix it yourself ndash you just raise a bug report, cross your fingers and wait. If MS SQL Server were open source, section 1.1 above would not exist, because I (and probably thousands of other frustrated users) would have damn well written a proper CSV parser and plumbed it in years ago. 1.16. Ease of installing and updating Does this matter Well, yes. Infrastructure flexibility is more important than ever and that trend will only continue. Gone are the days of the big fat server install which sits untouched for years on end. These days its all about fast, reliable, flexible provisioning and keeping up with cutting-edge features. Also, as the saying goes, time is money. I have installed MS SQL Server several times. I have installed PostgreSQL more times than I can remember - probably at least 50 times. Installing MS SQL Server is very slow. It involves immense downloads (who still uses physical install media) and lengthy, important-sounding processes with stately progress bars. It might fail if you dont have the right version of or the right Windows service pack installed. Its the kind of thing your sysadmin needs to find a solid block of time for. Installing PostgreSQL the canonical way ndash from a Linux repo ndash is as easy as typing a single command, like this: How long does it take I just tested this by spinning up a cheap VM in the cloud and installing PostgreSQL using the above command. It took 16 seconds . Thats the total time for the download and the install. As for updates, any software backed by a Linux repo is trivially easily patched and updated by pulling updates from the repo. Because repos are clever and PostgreSQL is not obscenely bloated, downloads are small and fast and application of updates is efficient. I dont know how easy MS SQL Server is to update. I do know that a lot of production MS SQL Server boxes in certain organisations are still on version 2008 R2 though. 1.17. The contrib modules As if the enormous feature set of PostgreSQL is not enough, it comes with a set of extensions called contrib modules. There are libraries of functions, types and utilities for doing certain useful things which dont quite fall into the core feature set of the server. There are libraries for fuzzy string matching, fast integer array handling, external database connectivity, cryptography, UUID generation, tree data types and loads, loads more. A few of the modules dont even do anything except provide templates to allow developers and advanced users to develop their own extensions and custom functionality. Of course, these extensions are trivially easy to install. For example, to install the fuzzystrmatch extension you do this: 1.18. Its free PostgreSQL is free as in freedom and free as in beer. Both types of free are extremely important. The first kind, free as in freedom, means PostgreSQL is open-source and very permissively licensed. In practical terms, this means that you can do whatever you want with it, including distributing software which includes it or is based on it. You can modify it in whatever way you see fit, and then you can distribute the modifications to whomever you like. You can install it as many times as you like, on whatever you like, and then use it for any purpose you like. The second kind, free as in beer, is important for two main reasons. The first is that if, like me, you work for a large organisation, spending that organisations money involves red tape. Red tape means delays and delays sap everyones energy and enthusiasm and suppress innovation. The second reason is that because PostgreSQL is free, many developers, experimenters, hackers, students, innovators, scientists and so on (the brainy-but-poor crowd, essentially) use it, and it develops a wonderful community. This results in great support (as I mentioned above ) and contributions from the intellectual elite. It results in a better product, more innovation, more solutions to problems and more time and energy spent on the things that really matter. 2. The counterarguments For reasons which have always eluded me, people often like to ignore all the arguments and evidence above and try to dismiss the case for PostgreSQL using misconceptions, myths, red herrings and outright nonsense. Stuff like this: 2.1. But a big-name vendor provides a safety net No it doesnt. This misconception is a variant of the old adage no-one ever got fired for buying IBM. Hilariously, if you type that into Google, the first hit is the Wikipedia article on fear, uncertainty and doubt - and even more hilariously, the first entry in the examples section is Microsoft. I promise I did not touch the Wikipedia article, I simply found it like that. In client-serving data analytics, you just have to get it right. If you destroy your reputation by buggering up an important job, your software vendor will not build you a new reputation. If you get sued, then maybe you can recover costs from your vendor - but only if they did something wrong. Microsoft isnt doing anything technically wrong with MS SQL Server, theyre simply releasing a terrible product and being up front about how terrible it is. The documentation admits its terrible. It works exactly as designed the problem is that the design is terrible. You cant sue Microsoft just because you didnt do your due diligence when you picked a database. Even if you somehow do successfully blame the vendor, you still have a messed up job and an angry client, who wont want to hear about MS SQL Servers unfortunate treatment of UTF-16 text as UCS-2, resulting in truncation of a surrogate pair during a substring operation and subsequent failure to identify an incriminating keyword. At best they will continue to demand results (and probably a discount) at worst, they will write you off as incompetent ndash and who could blame them, when you trusted their job to a RDBMS whose docs unapologetically acknowledge that it might silently corrupt your data Since the best way to minimise risk is to get the job done right, the best tool to use is the one which is most likely to let you accomplish that. In this case, thats PostgreSQL. 2.2. But what happens if the author of PostgreSQL dies Same thing that happens if the author of MS SQL Server dies ndash nothing. Also, needless to say, the author of PostgreSQL is as meaningless as the author of MS SQL Server. Theres no such thing. A senior individual with an IT infrastructure oversight role actually asked me this question once (about Hadoop, not PostgreSQL). There just seems to be a misconception that all open-source software is written by a loner who lives in his mums basement. This is obviously not true. Large open source projects like PostgreSQL and Hadoop are written by teams of highly skilled developers who are often commercially sponsored. At its heart, the development model of PostgreSQL is just like the development model of MS SQL Server: a large team of programmers is paid by an organisation to write code. There is no single point of failure. There is at least one key difference, though: PostgreSQLs source code is openly available and is therefore reviewed, tweaked, contributed to, improved and understood by a huge community of skilled programmers. Thats one of the reasons why its so much better. Crucially, because open-source software tends to be written by people who care deeply about its quality (often because they have a direct personal stake in ensuring that the software works as well as possible), it is often of the very highest standard (PostgreSQL, Linux, MySQL, XBMC, Hadoop, Android, VLC, Neo4JS, Redis, 7Zip, FreeBSD, golang, PHP, Python, R, Nginx, Apache, node. js, Chrome, Firefox. ). On the other hand, commercial software is often designed by committee, written in cube farms and developed without proper guidance or inspiration (Microsoft BOB, RealPlayer, Internet Explorer 6, iOS Maps, Lotus Notes, Windows ME, Windows Vista, QuickTime, SharePoint. ) 2.3. But open-source software isnt securereliabletrustworthyenterprise-readyetc Theres no kind way to say this: anyone who says such a thing is very ignorant, and you should ignore them ndash or, if youre feeling generous, educate them. Well, I guess Im feeling generous: Security: the idea that closed-source is more secure is an old misconception, for many good reasons which I will briefly summarise (but do read the links ndash theyre excellent): secrecy isnt the same as security an open review process is more likely to find weaknesses than a closed one properly reviewed open source software is difficult or impossible to build a back door into. If you prefer anecdotal evidence to logical arguments, consider that Microsoft Internet Explorer 6, once a flagship closed-source commercial product, is widely regarded as the least secure software ever produced, and that Rijndael, the algorithm behind AES, which governments the world over use to protect top secret information, is an open standard. In any case, relational databases are not security software. In the IT world, security is a bit like support our troops in the USA or think of the children in the UK ndash a trump card which overrules all other considerations, including common sense and evidence. Dont fall for it. Reliability: Windows was at one point renowned for its instability, although these days things are much better. (Supposedly, Windows 9x would spontaneously crash when its internal uptime counter, counting in milliseconds, exceeded the upper bound of an unsigned 32-bit integer, i. e. after 2 32 milliseconds or about 49.7 days. I have always wanted to try this.) Linux dominates the server space, where reliability is key, and Linux boxes routinely achieve uptimes measured in years. Internet Explorer has always (and still does) failed to comply with web standards, causing websites to break or function improperly the leaders in the field are the open-source browsers Chrome and Firefox. Lotus Notes is a flaky, crash-happy, evil mess Thunderbird just works. And I have more than once seen MS SQL Server paralyse itself by letting transaction log files blow up, something PostgreSQL does not do. Trustworthiness: unless youve been living under a rock for the past couple of years, you know who Edward Snowden is. Thanks to him, we know exactly what you cannot trust: governments and the large organisations they get their hooks into. Since Snowden went public, it is clear that NSA back doors exist in a vast array of products, both hardware and software, that individuals and organisations depend on to keep their data secure. The only defence against this is open code review. The only software that can be subjected to open code review is open source software. If you use proprietary closed-source software, you have no way of knowing what it is really doing under the hood. And thanks to Mr. Snowden, we now know that there is an excellent chance it is giving your secrets away. At the time of writing, 485 of the top 500 supercomputers in the world run on Linux. As of July 2014, Nginx and Apache, two open-source web servers, power over 70 of the million busiest sites on the net. The computers on the International Space Station (the most expensive single man-made object in existence) were moved from Windows to Linux in 2013 in an attempt to improve stability and reliability. The back-end database of Skype (ironically now owned by Microsoft) is PostgreSQL. GCHQ recently reported that Ubuntu Linux is the most secure commonly-available desktop operating system. The Large Hadron Collider is the worlds largest scientific experiment. Its supporting IT infrastructure, the Worldwide LHC Computing Grid, is the worlds largest computing grid. It handles 30 PB of data per year and spans 36 countries and over 170 computing centres. It runs primarily on Linux. Hadoop, the current darling of many large consultancies looking to earn Big Data credentials, is open-source. Red Hat Enterprise Linux CEntOS (Community Enterprise OS) SUSE Linux Enterprise Server Oracle Linux IBM Enterprise Linux Server etc. The idea that open-source software is not for the enterprise is pure bullshit. If you work in tech for an organisation which disregards open source, enjoy it while it lasts. They wont be around for long. 2.4. But MS SQL Server can use multiple CPU cores for a single query This is an advantage for MS SQL Server whenever youre running a query which is CPU-bound and not IO-bound. In real-life data analytics this happens approximately once every three blue moons. On those very rare, very specific occasions when CPU power is truly the bottleneck, you almost certainly should be using something other than an RDBMS. RDBMSes are not for number crunching. This advantage goes away when a server has to do many things at once (as is almost always the case). PostgreSQL uses multiprocessing ndash different connections run in different processes, and hence on different CPU cores. The scheduler of the OS takes care of this. Also, I suspect this query parallelism is what necessitates the merge method which MS SQL Server custom aggregate assemblies are required to implement bits of aggregation done in different threads have to be combined with each other, MapReduce-style. I further suspect that this mechanism is what prevents MS SQL Server aggregates from accepting ORDER BY clauses. So, congratulations ndash you can use more than one CPU core, but you cant do a basic string roll-up. 2.5. But I have MS SQL Server skills, not PostgreSQL skills Youd rather stick with a clumsy, awkward, unreliable system than spend the trivial amount of effort it takes to learn a slightly different dialect of a straightforward querying language Well, just hope you never end up in a job interview with me. 2.6. But a billion Microsoft users cant all be wrong This is a real-life quotation as well, from a senior data analyst I used to work with. I replied well there are 1.5 billion Muslims and 1.2 billion Catholics. They cant all be right. Ergo, a billion people most certainly can be wrong. (In this particular case, 2.7 billion people are wrong.) 2.7. But if it were really that good then it wouldnt be free People actually say this too. I feel sorry for these people, because they are unable to conceive of anyone doing anything for any reason other than monetary gain. Presumably they are also unaware of the existence of charities or volunteers or unpaid bloggers or any of the other things people do purely out of a desire to contribute or to create something or simply to take on a challenge. This argument also depends on an assumption that open source development has no benefit for the developer, which is nonsense. The reason large enterprises open-source their code and then pay their teams to continue working on it is because doing so benefits them. If you open up your code and others use it, then you have just gained a completely free source of bug fixes, feature contributions, code review, product testing and publicity. If your product is good enough, it is used by enough people that it starts having an influence on standards, which means broader industry acceptance. You then have a favoured position in the market as a provider of support and deployment services for the software. Open-sourcing your code is often the most sensible course of action even if you are completely self-interested. As a case in point, here I am spending my free time writing a web page about how fabulous PostgreSQL is and then paying my own money to host it. Perhaps Teradata or Oracle are just as amazing, but theyre not getting their own pages because I cant afford them, so I dont use them. 2.8. But youre biased No, I have a preference. The whole point of this document is to demonstrate, using evidence, that this preference is justified. If you read this and assume that just because I massively prefer PostgreSQL I must be biased, that means you are biased, because you have refused to seriously consider the possibility that it really is better. If you think theres actual evidence that I really am biased, let me know. 2.9. But PostgreSQL is a stupid name This one is arguably true its pretty awkward. It is commonly mispronounced, very commonly misspelt and almost always incorrectly capitalised. Its a good job that stupidness of name is not something serious human beings take into account when theyre choosing industrial software products. That being said, MS SQL Server is literally the most boring possible name for a SQL Server provided by MS. It has anywhere from six to eight syllables, depending on whether or not you abbreviate Microsoft and whether you say it sequel or ess queue ell, which is far too many syllables for a product name. Microsoft has a thing for very long names though ndash possibly its greatest achievement ever is Microsoft WinFX Software Development Kit for Microsoft Pre-Release Windows Operating System Code-Named Longhorn, Beta 1 Web Setup I count 38 syllables. Wow. 2.10. But SSMS is better than PGAdmin Its slicker, sure. Its prettier. It has code completion, although I always turn that off because it constantly screws things up, and for every time it helps me out with a field or table name, theres at least one occasion when it does something mental, like auto-correcting a common SQL keyword like table to a Microsoft monstrosity like TABULATIONNONTRIVIALDISCOMBOBULATEDMACHIAVELLIANGANGLYONID or something. For actually executing SQL and looking at the results in a GUI, PGAdmin is fine. Its just not spectacular. SSMS is obviously Windows-only. PGAdmin is cross-platform. This is actually quite convenient. You can run PGAdmin in Windows, where you have all your familiar stuff ndash Office, Outlook etc. ndash whilst keeping the back end RDBMS in Linux. This gets you the best of both worlds (even an open source advocate like me admits that if youre a heavy MS Office user, there is no serious alternative). Several guys I work with do this. One point in SSMSs favour is that if you run several row-returning statements in a batch, it will give you all the results. PGAdmin returns only the last result set. This can be a drag when doing data analytics, where you often want to simultaneously query several data sets and compare the results. Theres another thing though: psql. This is PostgreSQLs command-line SQL interface. Its really, really good. It has loads of useful catalog-querying features. It displays tabular data intelligently. It has tab completion which, unlike SSMSs code completion, is actually useful, because it is context sensitive. So, for example, if you type DROP SCHEMA t and hit tab, it will suggest schema names starting with t (or, if there is only one, auto-fill it for you). It lets you jump around in the file system and use ultra-powerful text editors like vim inline. It automatically keeps a list of executed commands. It provides convenient, useful data import and export functionality, including the COPY TO PROGRAM feature which makes smashing use of pipes and command-line utilities to provide another level of flexibility and control of data. It makes intelligent use of screen space. It is fast and convenient. You can use it over an SSH connection, even a slow one. Its only serious disadvantage is that it is unsuitable for people who want to be data analysts, but are scared of command lines and typing on a keyboard. 2.11. But MS SQL Server can import straight from Excel Yes. So what Excel can output to CSV (in a rare moment of sanity, Microsoft made Excels CSV export code work properly) and PostgreSQL can import CSV. Admittedly, its an extra step. Is the ability to import straight from Excel a particularly important feature in an analytics platform anyway 2.12. But PostgreSQL is slower than MS SQL Server A more accurate rephrasing would be MS SQL Server is slightly more forgiving if you dont know what youre doing. For certain operations, PostgreSQL is definitely slower than MS SQL Server ndash the easiest example is probably COUNT(). which is (I think) always instant in MS SQL Server and in PostgreSQL requires a full table scan (this is due to the different concurrency models they use). PostgreSQL is slow out-of-the box because its default configuration uses only a tiny amount of system resources ndash but any system being used for serious work has been tuned properly, so raw out-of-the-box performance is not a worthwhile thing to argue about. I once saw PostgreSQL criticised as slow because it was taking a long time to do some big, complex regex operations on a large table. But everyone knows that regex operations can be very computationally expensive, and in any case, what was PostgreSQL being compared to Certainly not the MS SQL Server boxes, which couldnt do regexes. PostgreSQLs extensive support for very clever indexes, such as range type indexes and trigram indexes, makes it orders of magnitude faster than MS SQL Server for a certain class of operations. But only if you know how to use those features properly. The immense flexibility you get from the great procedural language support and the clever data types allows PostgreSQL-based solutions to outperform MS SQL Server-based solutions by orders of magnitude. See my earlier example . In any case, the argument about speed is never only about computer time it is about developer time too. Thats why high-level languages like PHP and Python are very popular, despite the fact that C kicks the shit out of them when it comes to execution speed. They are slower to run but much faster to use for development. Would you prefer to spend an hour writing maintainable, elegant SQL followed by an hour of runtime, or spend three days writing buggy, desperate workarounds followed by 45 minutes of runtime 2.13. But you never mentioned such-and-such feature of MS SQL Server As I said in the banner and the intro. I am comparing these databases from the point of view of a data analyst, because Im a data analyst and I use them for data analysis. I know about SSRS, SSAS, in-memory column stores and so on, but I havent mentioned them because I dont use them (or equivalent features). Yes, this means this is not a comprehensive comparison of the two databases, and I never said it would be. It also means that if you care mostly about OLTP or data warehousing, you might not find this document very helpful. 2.14. But Microsoft has open-sourced Yeah, mere hours after I wrote all about how theyre a vendor lock-in monster and are anti-open source. Doh. However, lets look at this in context. Remember the almighty ruckus when the Office Open XML standard was being created Microsoft played every dirty trick in the book to ensure that MS Office wouldnt lose its dominance. Successfully, too ndash the closest alternative, LibreOffice, is still not a viable option, largely because of incompatibility with document formats. The OOXML standard that was finally pushed through is immense, bloated, ambiguous, inconsistent and riddled with errors. That debacle also started with an apparent gesture toward open standards on Microsofts part. If that seems harsh or paranoid, lets remember that this is an organisation that has been in legal trouble with both the USA and the EU for monopolistic and anticompetitive behaviour and abuse of market power, in the latter case being fined almost half a billion Euros. Then theres the involvement in SCOs potentially Linux-killing lawsuit against IBM. When Steve Ballmer was CEO he described Linux as a cancer (although Ballmer also said Theres no chance that the iPhone is going to get any significant market share. No chance, so maybe he just likes to talk nonsense). Microsoft has a long-established policy of preferring conquest to cooperation. So, if they play nice for the next few years and their magnanimous gesture ushers in a new era of interoperability, productivity and harmony, I (and millions of developers who want to get on with creating great things instead of bickering over platforms and standards) will be over the moon. For now, thinking that MS has suddenly become all warm and fuzzy would just be naive. 2.15. But youre insultingI dont like your toneyou come across as angryyou sound like a fanboythis is unprofessionalthis is a rant This page is unprofessional by definition ndash Im not being paid to write it. That also means I get to use whatever tone I like, and I dont have to hide the way I feel about things. I hope you appreciate the technical content even if you dont like the way I write if my tone makes this document unreadable for you, then I guess Ive lost a reader and youve lost a web page. Cest la vie.

No comments:

Post a Comment