PROJET AUTOBLOG


Mon Blog-notes à moi que j'ai

Site original : Mon Blog-notes à moi que j'ai

⇐ retour index

Mise à jour

Mise à jour de la base de données, veuillez patienter...

Compilation veille Twitter & RSS #2015-44

vendredi 30 octobre 2015 à 17:00

La moisson de liens pour la semaine du 26 au 30 octobre 2015. Ils ont, pour la plupart été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté. Bonne lecture

Security & Privacy

Tor Messenger Beta: Chat over Tor, Easily
Today we are releasing a new, beta version of Tor Messenger, based on Instantbird, an instant messaging client developed in the Mozilla community.

Software Engineering

Huge Page usage in PHP 7
Memory paging is a way Operating Systems manage userland process memory. Each process memory access is virtual, and the OS together with the hardware MMU must translate that address into a physical address used to access the data in main memory (RAM).
Our way to Go
At Dailymotion we embraced Go about a year ago. Most of our new back end projects are now using this young but still very powerful language. We like its simplicity, its performance and its static type checking.
Alternative Service Communication Using Pub/Sub
The HTTP protocol was designed for synchronous communication between two entities — for instance, a browser requesting a stylesheet or a server charging with a payment processor. Those are synchronous operations where nothing can proceed without an immediate response.
Fetching and serving billions of URLs with Aragog
Every week, we process billions of URLs to create a rich, relevant and safe experience for Pinners. The web pages, linked through Pins, contain rich signals that enable us to display useful information on Pins (like recipe ingredients, a product’s price and location data), infer better recommendations and fight spam. To take full advantage of these signals, not only do we need to fetch, store and process the page content, but we also need to serve the processed content at low latencies.

Mobile

Performance instrumentation for Android apps
Here at Facebook we always strive to make our apps faster. While we have systems like CTScan to track performance internally, the Android ecosystem is far too diverse for us to test every possibility in the lab. So we also work to complement our testing data with telemetry from real phones in the hands of real people.

System Engineering

Service Discovery in a Microservices Architecture
This is the fourth article in our series about building applications with microservices. The first article introduces the microservice architecture pattern and discussed the benefits and drawbacks of using microservices. The second and third articles in the series describe different aspects of communication within a microservices architecture. In this article, we explore the closely related problem of service discovery.
Enable Keepalive connections in Nginx Upstream proxy configurations
A very common setup to see nowadays is to have an Nginx SSL proxy in front of a Varnish configuration, that handles all the SSL configurations while Varnish still maintains the caching abilities.
How To Generate a /etc/passwd password hash via the Command Line on Linux
If you’re looking to generate the /etc/shadow hash for a password for a Linux user (for instance: to use in a Puppet manifest), you can easily generate one at the command line.
Using Varnish on the live streaming delivery platform
When building the live platform, we focused our attention on the delivery stack. We searched for a software capable of delivering small files (HLS/HDS manifests) as well as large files (Full-HD video fragments) efficiently
Results of experimenting with Brotli for dynamic web content
Compression is one of the most important tools CloudFlare has to accelerate website performance. Compressed content takes less time to transfer, and consequently reduces load times. On expensive mobile data plans, compression even saves money for consumers. However, compression is not free—it comes at a price. It is one of the most compute expensive operations our servers perform, and the better the compression rate we want, the more effort we have to spend.

Virtualization & containers

rkt v0.10.0: With a New API Service and a Better Image Build Tool
rkt v0.10.0 is here and marks another important milestone on our path to creating the most secure and composable container runtime. This release includes an improved user interface and a preview of the rkt service API, making it even easier to experiment with rkt in your microservices architectures.

Monitoring

The Docker monitoring problem
You have probably heard of Docker—it is a young container technology with a ton of momentum. But if you haven’t, you can think of containers as easily-configured, lightweight VMs that start up fast, often in under one second. Containers are ideal for microservice architectures and for environments that scale rapidly or release often.
Fullerite: A New Mineral To Collect Metrics
Monitoring system metrics (e.g. CPU utilization, loadavg, memory) is not a new problem. It’s been necessary for a long time and there are a few that already exist, most notably diamond and collectd. At Yelp, we run thousands of different machines hosted all over the world. The simple volume of metrics we monitor quickly starts to complicate how we collect them. After trying out a few different solutions (and breaking many things) we decided to write our own: fullerite.

Data Engineering

Stream Processing and Streaming Analytics - How It Works
Recently we started exploring the basics of Event Stream Processing (ESP) in our article Stream Processing - What Is It and Who Needs It. There we explained ESP capabilities, technologies, platforms, and business cases. There’s one more piece of information that you need to fully assess whether ESP will work for you and that’s how it works. Our discussion will be non-technical and will be an overview of what pretty much all ESP platforms offer in terms of operating capabilities and functions.
An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples
Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization.
How-to: Use HUE’s Notebook App with SQL and Apache Spark for Analytics
Apache Spark is getting popular and HUE contributors are working on making it accessible to even more users. Specifically, by creating a Web interface that allows anyone with a browser to type some Spark code and execute it. A Spark submission REST API was built for this purpose and can also be leveraged by the developers.
Prototyping Long Term Time Series Storage with Kafka and Parquet
Last time I tried to switch from Graphite time series storage to Cyanite/Cassandra but the attempt failed and I stayed with Whisper files. After struggling with keeping disk IOPS sane while ingesting hi-resolution performance data I ended up putting Whisper files into tmpfs and shortening data retention interval to just one day because my load tests usually don’t last more than several hours. Then I export data into R and do analysis. Large scale projects like Gorilla(pdf) and Atlas do similar things. They store recent data in RAM for dashboards and real-time analytics and then dump it to slow long term storage for offline analysis.

Database Engineering

MySQL & MariaDB

Open-sourcing Pinterest MySQL management tools
In the past, we’ve shared why you should love MySQL and how it helped Pinterest scale via sharding. At Oracle Open World today, we announced that we’re open-sourcing the vast majority of our automation that maintains our MySQL infrastructure. In this post, we’ll detail our MySQL environment, the tools used to manage it and how you can implement them to automate your MySQL infrastructure.
Increasing MySQL Fabric Resilience to Failures: Meet the Multi-Node Fabric
MySQL Performance: Yes, we can do more than 1.6M QPS (SQL) on MySQL 5.7 GA
It was exactly 2 years now since we reached 500K QPS with MySQL 5.7 – it was a great milestone, and the highest ever result seen on MySQL with true SQL queries ;-)) And this was on 32cores-HT machine.. (the linked article contains the whole long story how we arrived to 250K QPS first, then 275K, 350K, 440K, and finally 500K QPS) – the main improvement here is coming from greatly redesigned transactions and transaction list management in MySQL 5.7.
How Big Can Your Galera Transactions Be
While we should be aiming for small and fast transactions with Galera, it is always possible at some point you might want a single large transaction, but what is involved?
Internet of Things, Messaging and MySQL
So you want to do a personal project with the Internet of Things (maybe a home automation or metrics collection or something else)? In this blog post I will tell about my recent experience with this. I will give a talk on this topic at Oracle OpenWord 2015 (Tuesday, Oct 27, 6:15 p.m., Moscone South, 274).
How to Get a Galera Cluster Into Split Brain
“Split Brain” is the term commonly used for a cluster whose nodes have different contents, rather than identical as they should have. Typically, a “split brain” situation is the DBA’s nightmare, and the Galera software is designed to avoid it. Galera is very successful in that avoidance, and it needs some special steps by the DBA to achieve “split brain”. Here is how to do it - or, for most DBAs, what to avoid doing to not get a split-brain cluster.
Beware of Database Tuning Advisors
One of the common questions we get during sales demos is “does VividCortex give advice on my database’s configuration?” The assumption is that since our product sees lots of information about the database, operating system, and current configuration, it can “optimize” the database configuration. Or, at least, point out really obviously bad things? Surely that is not too hard to do?

Management & Organization

The what and why of product experimentation at Twitter
Experimentation is at the heart of Twitter’s product development cycle. This culture of experimentation is possible because Twitter invests heavily in tools, research, and training to ensure that feature teams can test and validate their ideas seamlessly and rigorously.

Compilation veille Twitter & RSS #2015-43

samedi 24 octobre 2015 à 11:00

La moisson de liens pour la semaine du 19 au 23 octobre 2015. Ils ont, pour la plupart été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté. Bonne lecture

Security

Protection from Unrestricted File Upload Vulnerability
How boring would social networking websites, blogs, forums and other web applications with a social component be if they didn’t allow their users to upload rich media like photos, videos and MP3s? The answer is easy: very, very boring! Thankfully, these social sites allow end-users to upload rich media and other files, and this makes communication on the world wide web more impactful and interesting.
Clickjacking: A Common Implementation Mistake Can Put Your Websites in Danger
The X-Frame-Options HTTP response header is a common method to protect against the clickjacking vulnerability since it is easy to implement and configure, and all modern browsers support it. As awareness of clickjacking has grown in the past several years, I have seen more and more Qualys customers adopt X-Frame-Options to improve the security of their web applications.
BoringSSL
We recently switched Google’s two billion line repository over to BoringSSL, our fork of OpenSSL. This means that BoringSSL is now powering Chromium (on nearly all platforms), Android M and Google’s production services. For the first time, the majority of Google’s products are sharing a single TLS stack and making changes no longer involves several days of work juggling patch files across multiple repositories.
Sécurité : pourquoi le WAF est-il indispensable ?
Disponibles directement sur internet, utilisant de plus en plus du code partagé et traitant bien souvent des données sensibles (données bancaires, données clients…), les applications web sont devenues une cible privilégiée d’attaques pour les cybercriminels. Explications.

System Engineering

4-day Docker and Kubernetes Training
I just delivered a 4-day deep-dive training course on Docker and Kubernetes to a customer in Atlanta. In true open-source spirit, I’d like to publish the source/slides and allow other people to benefit from it and contribute to making it better. Kubernetes is such an awesome project, and I learned a lot by doing this training. If you’re interested in hearing how awesome kubernetes is, and how we’ve made it even better with openshift, get a hold of me (@christianposta)!
Segment: Rebuilding Our Infrastructure with Docker, ECS, and Terraform
In Segment’s early days, our infrastructure was pretty hacked together. We provisioned instances through the AWS UI, had a graveyard of unused AMIs, and configuration was implemented three different ways.
As the business started taking off, we grew the size of the eng team and the complexity of our architecture. But working with production was still limited to a handful of folks who knew the arcane gotchas. We’d been improving the process incrementally, but we needed to give our infrastructure a deeper overhaul to keep moving quickly.
Installer HHVM avec fallback PHP-FPM sous Debian 8 et Nginx
Après avoir lu l’article de l’ami Seboss666, et plus récemment celui de Freddy de memo-linux, j’ai été tenté de tester HHVM pour remplacer PHP-FPM sur mon serveur. Mais tout d’abord, qu’est-ce que HHVM ? HHVM signifie “HipHop Virtual Machine” . C’est un logiciel open-source développé par Facebook (et utilisé sur le réseau social) qui est capable d’exécuter du PHP et du Hack. HHVM est de plus en plus utilisé aujourd’hui, de part sa compatibilité avec la quasi-totalité des fonctions de PHP. Par exemple, il est entièrement compatible avec WordPress. C’est très bien tout ça, mais à quoi bon l’utiliser me direz-vous ? Et bien sa grande force réside dans son compilateur JIT (Just-In-Time). Alors que son ancêtre, HPHPc (HipHop for PHP) compilait le PHP en C++, HHVM le compile en bytecode intermédiaire, le HHBC (HipHop ByteCode), qui est traduit dynamiquement en code x64, optimisé et exécuté nativement. En gros : HHVM est plus rapide. Nous allons voir comment l’installer sous Debian 8 avec Nginx, et nous allons le configurer pour que PHP-FPM prenne le relais si jamais il n’est pas opérationnel ou s’il renvoie une erreur. Je considère donc que vous avez déjà installé PHP-FPM. Je vous rappelle que HHVM n’est pas 100% compatible avec PHP, vous devez donc faire des tests avant de l’utiliser dans un environnement de production.
GRE tunnels with systemd-networkd
Switching to systemd-networkd for managing your networking interfaces makes things quite a bit simpler over standard networking scripts or NetworkManager. Aside from being easier to configure, it uses fewer resources on your system, which can be handy for smaller virtual machines or containers.

Monitoring

Monitoring RDS MySQL performance metrics
This post is part 1 of a 3-part series about monitoring MySQL on Amazon RDS. Part 2 is about collecting metrics from both RDS and MySQL, and Part 3 details how to monitor MySQL on RDS with Datadog.
How to collect RDS MySQL metrics
As covered in Part 1 of this series, MySQL on RDS users can access RDS metrics via Amazon CloudWatch and native MySQL metrics from the database instance itself. Each metric type gives you different insights into MySQL performance; ideally both RDS and MySQL metrics should be collected for a comprehensive view. This post will explain how to collect both metric types.
The Case For Tagging In Time Series Data
A while ago I wrote a blog post about time series database requirements that has been amazingly popular. Somewhere close to a dozen companies have told me they’ve built custom in-house time series databases, and that blog post was the first draft of a design document for it.
Top ELB health and performance metrics
This post is part 1 of a 3-part series on monitoring Amazon ELB. Part 2 explains how to collect its metrics, and Part 3 shows you how Datadog can help you monitor ELB.
What is Amazon Elastic Load Balancing?
Elastic Load Balancing (ELB) is an AWS service used to dispatch incoming web traffic from your applications across your Amazon EC2 backend instances, which may be in different availability zones.
How to collect AWS ELB metrics
This post is part 2 of a 3-part series on monitoring Amazon ELB. Part 1 explores its key performance metrics, and Part 3 shows you how Datadog can help you monitor ELB.
This part of the series is about collecting ELB metrics, which are available from AWS via CloudWatch.
Outlier detection at Datadog: A look at the algorithms
It is essential to be able to spot unhealthy hosts anywhere in your infrastructure so that you can rapidly identify their causes and minimize service degradation and disruption. Discovering those misbehaving servers usually involves a careful study of how they perform normally and then setting threshold alerts on various metrics for each host.

Software Engineering

Building Modern Software Delivery Pipelines With Jenkins and Docker
TL;DR: This blog outlines the key use-cases enabled by the newly released Docker plugins in the Jenkins communities. You can drill into more depth with an in-depth blog for each of the use case. The CloudBees team has actively worked within the community to release these plugins.
5 Lessons from 5 Years of Building Instagram
Instagram has always been generous in sharing their accumulated wisdom. Just take a look at the Related Articles section of this post to see how generous.
Make an Elasticsearch-powered REST API for any data with Ramses
In this short guide, I’m going to show you how to download a dataset and create a REST API. This new API uses Elasticsearch to power the endpoints, so you can build a product around your data without having to directly expose Elasticsearch in production. This allows for proper authentication, authorization, custom logic, other databases, and even auto-generated client libraries.
How to Apply RuboCop to Your Large Rails Codebase
As your company grows, your engineering team will tend to fluctuate. With team members coming and going, the style of the code can greatly alter at any given time. But throughout all these changes, your codebase stays; it’s not going away. So, obviously, you have to maintain it.

Data Engineering

How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code
Learn how to use OCR tools, Apache Spark, and other Apache Hadoop components to process PDF images at scale.
Optical character recognition (OCR) technologies have advanced significantly over the last 20 years. However, during that time, there has been little or no effort to marry OCR with distributed architectures such as Apache Hadoop to process large numbers of images in near-real time.
How We Use Deep Learning to Classify Business Photos at Yelp
Yelp hosts tens of millions of photos uploaded by Yelpers from all around the world. The wide variety of these photos provides a rich window into local businesses, a window we’re only just peeking through today.
Untangling Apache Hadoop YARN, Part 2
A new installment in the series about the tangled ball of thread that is YARN
In Part 1 of this series, we covered the fundamentals of clusters of YARN. In Part 2, you’ll learn about other components than can run on a cluster and how they affect YARN cluster configuration.
Skyline: ETL-as-a-Service
Data plays a big role at Pinterest, in both enabling new product experiences for Pinners and providing insights into Pinner behavior. We’re always thinking of ways to reduce the friction of converting our data into actionable insight. A question often asked by our data users (primarily analysts, product managers, engineers and data scientists) is, “How can I build a reporting dashboard to track xyz metrics daily?”. As the number of employees accessing data, the volume of data and the number of queries increased, scaling and supporting emerging requirements became a big challenge. In order to tackle this, we built a new data platform called Skyline.
Spark for Data Padawans Episode 1: a look at distributed data storage
If you’ve been anywhere near data in the past year or so you must have heard about the war going on between Spark and Hadoop for total control over the management of large amounts of data.

Machine Learning

Confidence Splitting Criterions Can Improve Precision And Recall in Random Forest Classifiers
The Trust and Safety Team maintains a number of models for predicting and detecting fraudulent online and offline behaviour. A common challenge we face is attaining high confidence in the identification of fraudulent actions. Both in terms of classifying a fraudulent action as a fraudulent action (recall) and not classifying a good action as a fraudulent action (precision).
A Universal Recommender
Recommender systems are more and more prevalent in everyday life. Whether keeping up to date with friends and work colleagues, finding something you want to buy, a movie you might like to watch or, well, you know, Twitter.
Getting Started with Data Collection for Machine Learning
It is the beginning of the year; many of you are thinking about how to leverage Machine Learning to improve your products or services. Here at PredictionIO, we work with many companies in deploying their first ML system and big data infrastructure. We put together some good practices on data collection we would like to share with you. If you are thinking about adopting ML down the road, collecting the right data in the right format will reduce your data cleansing effort and wasted data.

Databases

InfluxDB

Continuous Queries in InfluxDB - Part I
Queries returning aggregate, summary, and computed data are frequently used in application development. For example, if you’re building an online dashboard application to report metrics, you probably need to show summary data. These summary queries are generally expensive to compute since they have to process large amounts of data, and running them over and over again just wouldn’t scale. Now, if you could pre-compute and store the aggregates query results so that they are ready when you need them, it would significantly speed up summary queries in your dashboard application, without overloading your database. Enter InfluxDB’s continuous queries feature!
Testing InfluxDB Storage Engines
When you decide to build a database, you set yourself a particular software engineering challenge. As infrastructure software it must work. If people are going to rely on your system for reliably storing their data, you need to be sure it does just that.

MySQL & MariaDB

Protect Your Data #2: A Row-level Security Walkthrough in MariaDB 10.0
My last row-level security blog post got a few questions, so I decided that it would be good to follow up with more detail. The last blog post described some basic information about row-level security, but row-level security policies are highly dependent on an application’s or organization’s security requirements. In this blog post, I’m going to walk through an example row-level security implementation in MariaDB 10.0 in a little more detail.
MariaDB 10.1 can do 1 million queries per second
MariaDB 10.1 not only contains tons of new features, it has also been polished to deliver top performance. The biggest improvement has been achieved for scalability on massively multithreaded hardware.
Sharding Pinterest: How we scaled our MySQL fleet
This is a technical dive into how we split our data across many MySQL servers. We finished launching this sharding approach in early 2012, and it’s still the system we use today to store our core data.
Before we discuss how to split the data, let’s be intimate with our data. Mood lighting, chocolate covered strawberries, Star Trek quotes…

Organization & Management

Five Ways to Create a Data-Driven Culture
No one should need to be convinced the value of good data. It gives you the confidence to make decisions quickly and with less risk, it allows you to measure your success, and it lets you know when you need to adjust your course. But there’s a difference between knowing the value of data, and creating a culture around it. A data-driven culture is a culture where everyone quantifies their actions as much as possible, and asks themselves how their teams are having a tangible impact on the business. It turns your entire organization into a squad of analysts. But creating a data-driven culture isn’t always easy. Here are five steps that will help you get there.

Compilation veille Twitter & RSS #2015-42

vendredi 16 octobre 2015 à 18:00

La moisson de liens pour la semaine du 12 au 16 octobre 2015. Ils ont, pour la plupart été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté. Bonne lecture

System & Network Engineering

Monitoring with Bosun
Bosun is a monitoring and alerting system developed by the good folks at Stack Exchange, then open sourced for the rest of us. It’s written in Go, meaning its monitoring agents can run anywhere that Go can drop a binary… which is just about everywhere. So what exactly does it do and how does it compare to the likes of New Relic, CloudWatch, Nagios, Splunk Cloud, Server Density, and other monitoring tools?
Data Transfer With SCTP
Up to now I have reviewed message encoding and association initialisation. Now it’s time to see how SCTP does some real work - user data transfer. It is implemented via DATA and SACK chunks. Two peers can exchange user data only when the association is established, which means it should be in ESTABLISHED, SHUTDOWN-PENDING, or SHUTDOWN_SENT state. SCTP receiver must be able to receive at least 1500 bytes in a single packet, which means its initial a_rwnd value (in the INIT or INIT ACK chunk) must not be set to a lower value. Similar to TCP, SCTP supports fragmentation of user data, when it exceeds the MTU of the link. The receiver of the segmented data will reassemble all chunks before passing it to the user. On the other side, more than one DATA chunk can be bundled in a single message. Data fragmentation will be discussed in more details later.
Save some bandwidth by turning off TCP Timestamps
Looking at https://tools.ietf.org/html/rfc1323 there is a nice title: ‘TCP Extensions for High Performance’. It’s worth to take a look at date May 1992. Timestamps option may appear in any data or ACK segment, adding 12 bytes to the 20-byte TCP header.
What’s new in HAProxy 1.6
Yesterday, 13th of October, Willy has announced the release of HAProxy 1.6.0, after 16 months of development!
First good news is that release cycle goes a bit faster and we aim to carry on making it as fast.
A total of 1156 commits from 59 people were committed since the release of 1.5.0, 16 months ago.
Please find here the official announce: [ANNOUNCE] haproxy-1.6.0 now released!.
In his mail, Willy detailed all the features that have been added to this release. The purpose of this blog article is to highlight a few of them, providing the benefits and some configuration examples.
Container OS comparison
For anyone who’s been following the container community’s rise the past two years (after Solomon Hykes’s famous five-minute presentation at PyCon), you have surely seen many companies and projects spring up, offering really innovative ways of managing your applications.

Software Engineering

Inside LAN Sync
Dropbox LAN Sync is a feature that allows you to download files from other computers on your network, saving time and bandwidth compared to downloading them from Dropbox servers.
My GSoC 2015 project: push for XMPP
This is a guest post by Christian Ulrich, the author of mod_push and oshiya, an XEP-0357 compatible XMPP component. Christian is the first guest author on ProcessOne Blog. Starting with his article, we would like to open our blog to anyone who wants to share an interesting original post related to XMPP. If you are up for the challenge, please contact us.

Data Engineering

How-to: Use Apache Solr to Query Indexed Data for Analytics
If you were to ask well informed technical people about use cases for Solr, the most likely response would be that Solr (in combination with Apache Lucene) is an open source text search engine: one can use Solr to index documents, and after indexing, these same documents can be easily searched using free-form queries in much the same way as you would query Google. Still others might add that Solr has some very capable geo-location indexing capabilities that support radius, bounded-box, and defined-area searches. And both of the above answers would be correct.

Database Engineering

Introducing FiloDB
If you are a big data analyst, or build big data solutions for fast analytical queries, you are likely familiar with columnar storage technologies. The open source Parquet file format for HDFS saves space and powers query engines from Spark to Impala and more, while cloud solutions like Amazon Redshift use columnar storage to speed up queries and minimize I/O. Being a file format, Parquet is much more challenging to work with directly for real-time data ingest. For applications like IoT, time-series, and event data analytics, many developers have turned to NoSQL databases such as Apache Cassandra, due to their combination of high write scalability and the ease of using an idempotent, primary key-based database API. Most NoSQL databases are not designed for fast, bulk analytical scans, but instead for highly concurrent key-value lookups. What is missing is a solution that combines the ease of use of a database API, the scalability of NoSQL databases, with columnar storage technology for fast analytics.

MySQL & MariaDB

Proxy Protocol and Percona XtraDB Cluster: A Quick Guide
On September 21st, we released Percona XtraDB Cluster 5.6.25. This is the first PXC release supporting proxy-protocol that has been included in Percona Server since 5.6.25-73.0.
With this blog post, I want to promote a new feature that you may have ignored.
Storing UUID Values in MySQL Tables
After seeing that several blogs discuss storage of UUID values into MySQL, and that this topic is recurrent on forums, I thought I would compile some sensible ideas I have seen, and also add a couple new ones.
MySQL query digest with Performance Schema
Data AnalysisQuery analysis is a fantastic path in the pursuit to achieve high performance. It’s also probably the most repeated part of a DBA’s daily adventure. For most of us, the weapon of choice is definitely pt-query-digest, which is one of the best tools for slow query analysis out there.

Security

Innovating SSO with Google For Work
The modern workforce deserves access to technology that will help them work the way they want to in this increasingly mobile world. When Netflix moved to Google Apps, employees and contractors quickly adopted the Google experience, from signing into Gmail, to saving files on Drive, to creating and sharing documents. They are now so accustomed to the Google Apps login flow, down to the two-factor authentication, that we wanted to make Google their central sign on service for all cloud apps, not just Google Apps for Work or apps in the Google Apps Marketplace.

Architecture

Making the Case for Building Scalable Stateful Services in the Modern Era
For a long time now stateless services have been the royal road to scalability. Nearly every treatise on scalability declares statelessness as the best practices approved method for building scalable systems. A stateless architecture is easy to scale horizontally and only requires simple round-robin load balancing.
What’s not to love? Perhaps the increased latency from the roundtrips to the database. Or maybe the complexity of the caching layer required to hide database latency problems. Or even the troublesome consistency issues.
L’architecture microservices sans la hype : qu’est-ce que c’est, à quoi ça sert, est-ce qu’il m’en faut ?
En 2015, le pic des microservices a été atteint : pas une conférence sans un ingénieur de Netflix pour vous vendre du rêve, pas une semaine sans nouveau framework magique pour tout faire sans se poser de question.
Résultat : une focalisation sur les outils et les belles histoires plutôt que sur les questions de fond.
Il nous a donc semblé utile de faire le point sur les aspects architecturaux des microservices, car choisir un style d’architecture pour un système d’information a des conséquences structurantes sur la vie des projets et l’organisation de l’entreprise.

Compilation veille Twitter & RSS #2015-41

vendredi 9 octobre 2015 à 18:00

La moisson de liens pour la semaine du 5 au 9 octobre 2015. Ils ont, pour la plupart été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté. Bonne lecture

Security

Two misconceptions about LUKS
Few weeks ago I had a discussion about LUKS (Linux Unified Key Setup) and the purpose of using a (openssl|gpg|whatever)-encrypted random passphrase.

System Engineering

Single RX queue kernel bypass in Netmap for high packet rate networking
In a previous post we discussed the performance limitations of the Linux kernel network stack. We detailed the available kernel bypass techniques allowing user space programs to receive packets with high throughput. Unfortunately, none of the discussed open source solutions supported our needs. To improve the situation we decided to contribute to the Netmap project. In this blog post we’ll describe our proposed changes.
Logstash configuration tuning
Logstash is a powerful beast and when it’s firing on all cylinders to crunch data, it can use a lot of resources. The goal of this blog post is to provide a methodology to optimise your configuration and allow Logstash to get the most out of your hardware.
Creating NGINX Rewrite Rules
In this blog post, we discuss how to create NGINX rewrite rules (the same methods work for both NGINX Plus and the open source NGINX software). Rewrite rules change part or all of the URL in a client request, usually for one of two purposes
Improving the Linux kernel with upstream contributions
The Linux kernel is constantly changing, growing roughly 1.4 million lines of code in the last year alone. The kernel community is constantly adding new features, supporting new hardware, refining interfaces, and fixing bugs. In order to take advantage of these changes, production Linux deployments need to take on the difficult task of validating new kernels across all their workloads.

Monitoring

ElastAlert: Alerting At Scale With Elasticsearch, Part 1
Yelp’s web servers log data from the millions of sessions that our users initiate with Yelp every day. Our engineering teams can learn a lot from this data and use it to help monitor many critical systems. If you know what you’re looking for, archiving log files and retrieving them manually might be sufficient, but this process is tedious. As your infrastructure scales, so does the volume of log files, and the need for a log management system becomes apparent. Having already used it very successfully for other purposes, we decided to use Elasticsearch for indexing our logs for fast retrieval, powerful search tools and great visualizations.

Software Engineering

Testing for Truth vs Maximizing Revenue
In my career I’ve worked in two different worlds. The first is the world of science – clever people running careful experiments and seeking truth. The ultimate goal is to publish papers that reveal truths.
The second place I’ve worked is the world of finance – clever people trying to make choices in the face of uncertainty that have a high probability of making money.

Databases Engineering

How We Partitioned Airbnb’s Main Database in Two Weeks
Heading into the 2015 summer travel season, the infrastructure team at Airbnb was hard at work scaling our databases to handle the expected record summer traffic. One particularly impactful project aimed to partition certain tables by application function onto their own database, which typically would require a significant engineering investment in the form of application layer changes, data migration, and robust testing to guarantee data consistency with minimal downtime. In an attempt to save weeks of engineering time, one of our brilliant engineers proposed the intriguing idea of leveraging MySQL replication to do the hard part of guaranteeing data consistency. (This idea is independently listed an explicit use cases of Amazon RDS’s “Read Replica Promotion” functionality.) By tolerating a brief and limited downtime during the database promotion, we were able to perform this operation without writing a single line of bookkeeping or migration code. In this blog post, we will share some of our work and what we learned in the process.

InfluxDB

The new InfluxDB storage engine: a Time Structured Merge Tree
For more than a year we’ve been talking about potentially making a storage engine purpose-built for our use case of time series data. Today I’m excited to announce that we have the first version of our new storage engine available in a nightly build for testing. We’re calling it the Time Structured Merge Tree or TSM Tree for short.

Data Engineering

How-to: Build a Machine-Learning App Using Sparkling Water and Apache Spark
The Sparkling Water project is nearing its one-year anniversary, which means Michal Malohlava, our main contributor, has been very busy for the better part of this past year. The Sparkling Water project combines H2O machine-learning algorithms with the execution power of Apache Spark. This means that the project is heavily dependent on two of the fastest growing machine-learning open source projects out there. With every major release of Spark or H2O there are API changes and, less frequently, major data structure changes that affect Sparkling Water. Throw Cloudera releases into the mix, and you have a plethora of git commits dedicated to maintaining a few simple calls to move data between the different platforms.
Why Sampling Your Mobile Analytics Data is Bad for Growth
Two months ago, we shared our vision of scalable analytics and the architecture that makes it a reality. One of the most important consequences of this is that our customers can track all of their data without breaking the bank — no sampling necessary.
It’s long been the case that the costs associated with using high-quality analytics services force companies into tradeoffs of picking which events or users to track, with sampled data being the most common fate that people resign to.
Using Apache Spark and MySQL for Data Analysis
In contrast to popular belief, Spark does not require all data to fit into memory but will use caching to speed up the operations (just like MySQL). Spark can also run in standalone mode and does not require Hadoop; it can also be run on a single server (or even laptop or desktop) and utilize all your CPU cores.
Continuous Distribution Goodness-of-Fit in MLlib: Kolmogorov-Smirnov Testing in Apache Spark
Data can come in many shapes and forms, and can be described in many ways. Statistics like the mean and standard deviation of a sample provide descriptions of some of its important qualities. Less commonly used statistics such as skewness and kurtosis provide additional perspective into the data’s profile.
However, sometimes we can provide a much neater description for data by stating that a sample comes from a given distribution, which not only tells us things like the average value that we should expect, but effectively gives us the data’s “recipe” so that we can compute all sorts of useful information from it. As part of my summer internship at Cloudera, I added implementations to Apache Spark’s MLlib library of various statistical tests that can help us draw conclusions regarding how well a distribution fits data. Specifically, the implementations pertain to the Spark JIRAs SPARK-8598 and SPARK-8884.

Architecture

Minimum Viable Cluster
In the past there was a clear distinction between high performance (HP) clustering and high availability (HA) clustering, however the lines have been bluring for some time. People have scaled HA clusters upwards and HP inspired clusters have been used to provide availability through redundancy.
The trend in providing availablity of late has been towards the HP model - pools of anonymous and stateless workers that can be replaced at will. A really attractive idea but in order to pull it off they have to make assumptions that may or may not be compatible with some peoples’ workloads.
Five Tips for Better AWS Performance
Amazon Web Services (AWS), the leading provider of cloud-based computing services, is a great resource and platform for web application development. Their customers include big names such as Acquia, Adobe, NASA, Netflix, and Zynga.
You can use AWS for prototyping, for mixed deployments alongside physical servers that you manage directly, or for AWS-only deployments. The performance you see on AWS can, however, vary widely, just as on any other public cloud – and you don’t have the same direct control of your AWS deployment that you do for servers that you buy and manage yourself. So AWS users have mastered a number of optimizations to get the most out of the environment.
10 Tips for 10x Performance
Improving web application performance is more critical than ever. The share of economic activity that’s online is growing; more than 5% of the developed world’s economy is now on the Internet. And our always-on, hyper-connected modern world means that user expectations are higher than ever. If your site does not respond instantly, or if your app does not work without delay, users quickly move on to your competitors.
Your Load Generator is Probably Lying to You - Take the Red Pill and Find Out Why
Pretty much all your load generation and monitoring tools do not work correctly. Those charts you thought were full of relevant information about how your system is performing are really just telling you a lie. Your sensory inputs are being jammed.
To find out how listen to the Morpheous of performance monitoring Gil Tene, CTO and co-founder at Azul Systems, makers of truly high performance JVMs, in a mesmerizing talk on How NOT to Measure Latency.
This talk is about removing the wool from your eyes. It’s the red pill option for what you thought you were testing with load generators.
The Impact of the Observer Effect on Microservices Architecture
Application availability is not just the measure of “being up”. Many apps can claim that status. Technically they are running and responding to requests, but at a rate which users would certainly interpret as being down. That’s because excessive load times can (and will be) interpreted as “not available.” That’s why it’s important to view ensuring application availability as requiring attention to all its composite parts: scalability, performance, and security.

Compilation veille Twitter & RSS #2015-40

samedi 3 octobre 2015 à 10:00

La moisson de liens pour la semaine du 28 septembre au 2 octobre 2015. Ils ont, pour la plupart été publiés sur mon compte Twitter. Les voici rassemblés pour ceux qui les auraient raté. Bonne lecture

Security

Container Security with SELinux and CoreOS
At CoreOS, running containers securely is a number one priority. We recently landed a number of features that are helping make CoreOS Linux a trusted and even more secure place to run containers. As of the 808.0.0 release, CoreOS Linux is tightly integrated with SELinux to enforce fine-grained permissions for applications. Building on top of these permissions, our container runtime, rkt, has gained support for SVirt in addition to a default SELinux policy. The rkt SVirt implementation is compatible with Docker’s SVirt support, keeping you secure no matter what container runtime you choose.
In bed with TLS - Part I : TLS, PFS et Logjam
Fin mai, des chercheurs, dont des chercheurs de l’INRIA (cocorico), ont dévoilé, sous le nom de Logjam, des défauts de sécurité impactant TLS. J’avais pensé écrire une série de billets là dessus, ce que je n’avais malheureusement pas eu loisir de faire jusqu’ici.
In bed with TLS - Part II : PKIX, DANE et HPKP
Dans l’épisode précédent, nous avions profité de la récente actualité Logjam, pour faire un petit tour de TLS en général, et de l’échange de clés Diffie-Hellman en particulier. Nous avions très brièvement évoqué la problématique de la certification. Je vous propose de l’examiner ici plus en détail ainsi que, plus généralement, l’infrastructure à clés publiques.

System Engineering

Flux: A New Approach to System Intuition
On the Traffic and Chaos Teams at Netflix, our mission requires that we have a holistic understanding of our complex microservice architecture. At any given time, we may be called upon to move the request traffic of many millions of customers from one side of the planet to the other. More frequently, we want to understand in real time what effect a variable is having on a subset of request traffic during a Chaos Experiment. We require a tool that can give us this holistic understanding of traffic as it flows through our complex, distributed system.
Strategy: Taming Linux Scheduler Jitter Using CPU Isolation and Thread Affinity
When nanoseconds matter you have to pay attention to OS scheduling details. Mark Price, who works in the rarified high performance environment of high finance, shows how in his excellent article on Reducing system jitter.
Scaling Graphite with Go Using graphite-ng or InfluxDB
If you’ve watched Coda Hale’s popular Metrics, Metrics Everywhere video, you know that one of the main takeaways of his talk is that data helps you to make better decisions, and data must be measured, to be effectively managed.
Irreversible Failures: Lessons from the DynamoDB Outage
Summary: Most server problems, once identified, can be quickly solved with a simple compensating action—for instance, rolling back the bad code you just pushed. The worst outages are those where reversing the cause doesn’t undo the effect. Fortunately, this type of issue usually generates some visible markers before developing into a crisis. In this post, I’ll talk about how you can avoid a lot of operational grief by watching for those markers.
Analysing performance problems with systemd
Now that Systemd is the default init-system in fresh installations of Debian GNU/Linux it is worth highlighting some of the new features.

Monitoring

Top ELB health and performance metrics
Elastic Load Balancing (ELB) is an AWS service used to dispatch incoming web traffic from your applications across your Amazon EC2 backend instances, which may be in different availability zones.
How to collect AWS ELB metrics
This part of the series is about collecting ELB metrics, which are available from AWS via CloudWatch. They can be accessed in three different ways:
Outlier detection at Datadog: A look at the algorithms
It is essential to be able to spot unhealthy hosts anywhere in your infrastructure so that you can rapidly identify their causes and minimize service degradation and disruption. Discovering those misbehaving servers usually involves a careful study of how they perform normally and then setting threshold alerts on various metrics for each host.

Software Engineering

Optimizing Android bytecode with Redex
As more and more people around the world start logging on to Facebook, we have an increasingly large responsibility to keep things fast. This is especially true in developing areas, where devices stay in the market longer and people have longer upgrade cycles for new devices. We want to make sure we look into possible opportunities for performance improvements across all of our major mobile platforms.
Wingify releases Bayesian A/B tester
I’ve written a number of posts here about a/b testing, and readers have probably observed that I favor the Bayesian approach. I’m very happy to announce that Wingify (my employer) has release SmartStats - a fully Bayesian A/B testing engine. I’ve always maintained that you should A/B test even if you won’t do a good job - it’s certainly better than flipping a coin.

Network Engineering

Multiplexing: TCP vs HTTP2
Can you use both? Of course you can! Here comes the (computer) science…
One of the big performance benefits of moving to HTTP/2 comes from its extensive use of multiplexing. For the uninitiated, multiplexing is the practice of reusing a single TCP connection for multiple HTTP requests and responses. See, in the old days (HTTP/1), a request/response pair required its own special TCP connection. That ultimately resulted in the TCP connection per host limits imposed on browsers and, because web sites today are comprised of an average of 86 or more individual objects each needing its own request/response, slowed down transfers. HTTP/1.1 let us use “persistent” HTTP connections, which was the emergence of multiplexing (connections could be reused) but constrained by the synchronous (in order) requirement of HTTP itself. So you’d open 6 or 7 or 8 connections and then reuse them to get those 80+ objects.

Data Engineering

Hadoop filesystem at Twitter
Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of our data platform and provides vast storage for analytics of user actions on Twitter. In this post, we will highlight our contributions to ViewFs, the client-side Hadoop filesystem view, and its versatile usage here.
Spark for Data Padawans Episode 1: a look at distributed data storage
If you’ve been anywhere near data in the past year or so you must have heard about the war going on between Spark and Hadoop for total control over the management of large amounts of data.
Spark for Data Padawans Episode 2: Spark vs Hadoop?
The cat is out of the bag, Data Science Studio now integrates with Spark! It’s the perfect moment (I know, crazy good timing right!) for me to continue my presentation of Spark for super beginners with episode 2: the birth of Spark and how it compares to Hadoop.
Spark for Data Padawans Episode 3: Spark vs MapReduce
After learning about Hadoop and distributed data storage, and what exactly Spark is in the previous episodes, it’s time to dig a little deaper to understand why even if Spark is great, it isn’t necessarily a miracle solution to all your data processing issues. It’s time for Spark for super beginners episode 3!
Spark for Data Padawans Episode 4: Data Science Studio meets Spark
This is it guys, the last episode of my quadrilogy on Spark for data padawans. For this last section, I’ve tried to sum up what having a technology like Spark in DSS brings to the mix. Tell me what you want to know about next!
How Facebook Tells Your Friends You’re Safe in a Disaster in Under Five Minutes
In a disaster there’s a raw and immediate need to know your loved ones are safe. I felt this way during 9/11. I know I’ll feel this way during the next wild fire in our area. And I vividly remember feeling this way during the 1989 Loma Prieta earthquake.
Most earthquakes pass beneath notice. Not this one and everyone knew it. After ceiling tiles stopped falling like snowflakes in the computer lab, we convinced ourselves the building would not collapse, and all thoughts turned to the safety of loved ones. As it must have for everyone else. Making an outgoing call was nearly impossible, all the phone lines were busy as calls poured into the Bay Area from all over the nation. Information was stuck. Many tense hours were spent in ignorance as the TV showed a constant stream of death and destruction.
Bridging Batch and Streaming Data Ingestion with Gobblin
Less than a year ago, we introduced Gobblin, a unified ingestion framework, to the world of Big Data. Since then, we’ve shared ongoing progress through a talk at Hadoop Summit and a paper at VLDB. Today, we’re announcing the open source release of Gobblin 0.5.0, a big milestone that includes Apache Kafka integration.
Kudu: New Apache Hadoop Storage for Fast Analytics on Fast Data
This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures.
The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.
RecordService: For Fine-Grained Security Enforcement Across the Hadoop Ecosystem
We’re thrilled to announce the beta availability of RecordService, a distributed, scalable, data access service for unified access control and enforcement in Apache Hadoop. RecordService is Apache Licensed open source that we intend to transition to the Apache Software Foundation. In this post, we’ll explain the motivation, system architecture, performance characteristics, expected use cases, and future work that RecordService enables.

Architecture

Orchestrating deployments with Jenkins Workflow and Kubernetes
In a previous series of blogs, we covered how to use Docker with Jenkins to achieve true continuous delivery and improve existing pipelines in Jenkins. While deployments of single Docker containers were supported with this initial integration, the CloudBees team and Jenkins community’s most recent work on Jenkins Workflow will also let administrators launch and configure clustered Docker containers with Kubernetes and the Google Cloud Platform.

Databases

A Few Fundamental Rules for Enlightened Database Monitoring
It’s a pretty fair assumption that if your database is big enough and complex enough to produce metrics that warrant a monitoring system, it’s also complex enough to produce tons of data that are ultimately more distracting than relevant. It’s not unusual to look at a bevy of monitoring possibilities and feel overwhelmed, uncertain about where to center your focus. Of course, every database is different, but there are some fundamental truths you should consider when you ask yourself, “What should I monitor?” Some of these ideas might seem simple, but if you don’t keep these in mind, you’d be surprised how easy it can be to lose sight of the big picture.

MySQL & MariaDB

Capture database traffic using the Performance Schema
Capturing data is a critical part of performing a query analysis, or even just to have an idea of what’s going on inside the database.

Redis

Clarifications about Redis and Memcached
If you know me, you know I’m not the kind of guy that considers competing products a bad thing. I actually love the users to have choices, so I rarely do anything like comparing Redis with other technologies.
However it is also true that in order to pick the right solution users must be correctly informed.
Lazy Redis is better Redis
Everybody knows Redis is single threaded. The best informed ones will tell you that, actually, Redis is kinda single threaded, since there are threads in order to perform certain slow operations on disk. So far threaded operations were so focused on I/O that our small library to perform asynchronous tasks on a different thread was called bio.c: Background I/O, basically.
Error happened! 0 - Undefined variable: xml In: /var/www/autoblog/autoblogs/autoblog.php:77 http://autoblog.postblue.info/autoblogs/blognotesjbfavreorg_5e47fdd1d7911ab7dc4bb4d33e50793f8c5f3394/ #0 /var/www/autoblog/autoblogs/autoblog.php(77): exception_error_handler(8, 'Undefined varia...', '/var/www/autobl...', 77, Array) #1 /var/www/autoblog/autoblogs/autoblog.php(368): VroumVroum_Feed_Exception::getXMLErrorsAsString(Array) #2 /var/www/autoblog/autoblogs/autoblog.php(932): VroumVroum_Blog->update() #3 /var/www/autoblog/autoblogs/blognotesjbfavreorg_5e47fdd1d7911ab7dc4bb4d33e50793f8c5f3394/index.php(1): require_once('/var/www/autobl...') #4 {main}