formal grammar

found an excellent introduction to grammars here:

I ran across the Wikipedia article on this subject today. While I found it fascinating and enthralling, it also took me about half an hour to properly decode the notation being used, so here I have interpreted it for the layman.

What is a formal grammar?


Begin with symbols. A symbol can be anything. One common example is an upper- or lower-case letter of the Roman alphabet.


If we allow other special characters to be symbols, then we could also have:


But our symbols could also be English words. If this were the case, then our symbols would be things like:


A symbol in isolation does not need to mean anything.


A set of symbols is called an alphabet. An alphabet must be finite.


A string is a finite series of symbols. All the symbols must be chosen from the same alphabet. If we use Roman letters as our alphabet, then some example strings would be:


If our alphabet contained special characters, then some more example strings might be:


If, instead, our alphabet consisted of English words, then a string (or sentence) would simply be a series of words, such as

“bucket July embarrass whatsoever isn’t a the”

Although every string must be finite, notice how there are infinitely many possible finite strings.

“bucket bucket”
“bucket bucket bucket”
“bucket bucket bucket bucket”

For every alphabet, there is also a single empty string:


Notice, again, how strings have no meaning ascribed to them.


Given an alphabet, a formal languageL is any set of finite strings over that alphabet.

  • A formal language may contain a finite number of strings.
    • The smallest possible language is the empty language, containing no strings at all (this is not to be confused with the language containing only the empty string).
  • Or, a language may contain infinitely many strings.
    • The largest possible language is the one containing every possible finite string.

Strings which are in L can be considered “correct”. Strings which are not in L can be considered “incorrect”. For example, if our alphabet was the set of all English words, then our language could be the set of all grammatically correct English sentences:

“The cat sat on the mat”
“Colourless green ideas sleep furiously”

For a language L to be well-defined, we only need one thing: a process which generates all of the strings in L.


A formal grammar is a finite set of rules for generating “grammatically correct” sentences.

A formal grammar works by starting with a single “unfinished sentence” or “root”. Then, different rules are applied in turn, modifying the sentence each time, until the sentence is deemed to be “finished” and the process terminates.

(By definition,) the set of sentences which can be generated by following the rules of a formal grammar comprise a formal language. The sentences which cannot be generated in this way do not fall in the formal language.

Once again, note that a grammar ascribes no meaning to the language it generates.

continue reading…

What Every Data Programmer Needs to Know about Disks

Here is an excellent presentation on how your application does IO. I have embedded the presentation from Ted Dziuba here:

From Session notes:

In this session, we will explore everything that goes on from when you call file.write() in your favorite programming language to when the data is finally written to disk. Heavily geared toward Linux programmers, this talk will cover a mix between hardware and software topics, the basics that every programmer who works with data should know. We’ll also take a look at some open source data storage packages – what they do right, and what they do horrifically wrong. We’ll also look into the persistent storage aspect of virtual machines such as Xen.

Attendees will walk away knowing some practical ways to optimize disk I/O intensive applications, through both hardware and software, as well as knowing how to work around some common traps that will rear their ugly heads at the worst possible times.

All about 64-bit programming in one place

Very comprehensive collection of information about 64-bit programming:

link to source

By Andrey Karpov,

In this post I’ve collected a lot of links on the topic of 64-bit C/C++ software development. These include my articles and articles by my colleagues in the sphere of developing safe and efficient 64-bit code; FAQ’s and a training course. There are also many reviews of third-party articles on 64-bit software development. Enjoy yourself studying the materials.

1. Articles:

2. Lessons on development of 64-bit C/C++ applications

Main page:

The course is composed of 28 lessons devoted to introduction to 64-bit systems, issues of building 64-bit applications, methods of searching errors specific to 64-bit code and code optimization. Such questions are also considered as estimate of the cost of moving to 64-bit systems and rationality of this move.

The contents of the course:

  • Lesson 01. What 64-bit systems are.
  • Lesson 02. Support of 32-bit applications.
  • Lesson 03. Porting code to 64-bit systems. The pros and cons.
  • Lesson 04. Creating the 64-bit configuration.
  • Lesson 05. Building a 64-bit application.
  • Lesson 06. Errors in 64-bit code.
  • Lesson 07. The issues of detecting 64-bit errors.
  • Lesson 08. Static analysis for detecting 64-bit errors.
  • Lesson 09. Pattern 01. Magic numbers.
  • Lesson 10. Pattern 02. Functions with variable number of arguments.
  • Lesson 11. Pattern 03. Shift operations.
  • Lesson 12. Pattern 04. Virtual functions.
  • Lesson 13. Pattern 05. Address arithmetic.
  • Lesson 14. Pattern 06. Changing an array’s type.
  • Lesson 15. Pattern 07. Pointer packing.
  • Lesson 16. Pattern 08. Memsize-types in unions.
  • Lesson 17. Pattern 09. Mixed arithmetic.
  • Lesson 18. Pattern 10. Storage of integer values in double.
  • Lesson 19. Pattern 11. Serialization and data interchange.
  • Lesson 20. Pattern 12. Exceptions.
  • Lesson 21. Pattern 13. Data alignment.
  • Lesson 22. Pattern 14. Overloaded functions.
  • Lesson 23. Pattern 15. Growth of structures’ sizes.
  • Lesson 24. Phantom errors.
  • Lesson 25. Working with patterns of 64-bit errors in practice.
  • Lesson 26. Optimization of 64-bit programs.
  • Lesson 27. Peculiarities of creating installers for a 64-bit environment.
  • Lesson 28. Estimating the cost of 64-bit migration of C/C++ applications.

You may open all the lessons in one file (the print version as well).

3. Knowledge Base

4. Articles’ Reviews

5. Blog

6. Detect 64-Bit Portability Issues

7. Terminology

8. Our 64-bit reddit


C++AMP: Accelerated Massive Parallelism

Microsoft will be launching it’s own version of HPC (High Performance Computing), along the lines khronos-OpenCL, soon. Seems like Heterogeneous computing is getting lot more interesting every day.

Here is the news brief from Dr Dobbs Journal:

By Gaston Hillar

The next version of Microsoft Visual C++ will introduce C++ Accelerated Massive Parallelism, also known as “C++ AMP.” Herb Sutter announced C++ AMP and unveiled Microsoft’s plans to provide heterogeneous platform support in Visual C++. The first targets are GPU (Graphics Processing Unit) and APU (Accelerated Processing Unit) programming but the platform would be ready for other compute accelerators.

DirectCompute has been part of the Microsoft DirectX collection of APIs since the release of DirectX 11 and it provided some support for GPGPU (General-Purpose Computing on Graphics Processing Units). You can use DirectCompute to run general-purpose code on the GPUs that provide support for this API.

However, Microsoft wants to lower the barrier to entry for heterogeneous hardware programmability. C++ AMP builds on DirectCompute, introduces one core C++ language extension and makes it part of Visual C++, and integrates it with Visual Studio. GPUs are found in discrete cards or integrated in the microprocessors, and therefore, interest in GPGPU is growing fast. For example, the Intel Sandy Bridge microarchitecture, introduced in the second-generation Intel Core processor family, integrates a GPU with a CPU on a single chip.

Parallel programming is hard and heterogeneous hardware programmability combined with parallelism is even harder. However, C++ AMP introduces an STL-like library for multidimensional data, allowing you to use your existing C++ STL knowledge. Thus, if you are a Visual C++ developer, you don’t have to learn a new language or combine C++ with C. In addition, you can use the same compiler: the Visual C++ compiler.

continue reading…

Related Resources:

Parallelism as a First Class Citizen in C and C++, the time has come

by James Reinders

It is time to make Parallelism a full First Class Citizen in C and C++.  Hardware is once again ahead of software, and we need to close the gap so that application development is better able to utilize the hardware without low level programming.

The time has come for high level constructs for task and data parallelism to be explicitly added to C and C++.  This will enable Parallel Programming in C and C++ to be fully portable, easily intelligible, and consistently decipherable by a compiler.

Language solutions are superior to library solutions. Library solutions provide fertile grounds for exploration. Intel Threading Building Blocks (TBB) has been the most popular library solution for parallelism for C++. For more than five years, TBB has grown and proven itself. It is time to take it to the next level, and move to language solutions to support what we can be confident is needed in C and C++.

We need mechanisms for parallelism that have strong space-time guarantees, simple program understanding, and serialization semantics – all things we do not have in C and C++ today.

We should tackle task and data parallelism both, and as an industry we know how.

  • Task parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to individual processor cores. Mapping should be the job of tools (including run time schedulers), not explicit programming. Hence, the goal is to shift all programming to tasks, not threads. This has many benefits that have been demonstrated often. A simple fork/join support is fundamental (spawn/sync in Cilk Plus terminology). Looping with a “parallel for” avoids looping to spawn iterations serially and thereby expresses parallelism well.
  • Data parallelism: the goal is to abstract this enough that a programmer is not explicitly mapping work to SIMD instructions vs. multiple processor cores vs. attached computing (GPUs or co-processors). Mapping should be the job of tools, not explicit programming. Hence, the goal is to shift all programming back to mathematical expressions, not intrinsics or explicitly parallel algorithm decompositions.

The solutions that show the most promise are documented in the Cilk™ Plus open specification (  They are as follows:

For data parallelism:

  • Explicit array syntax, to eliminate the need for explicit looping in a program to loop (serially) across elements to do the same operation multiple times
  • Elemental functions, to eliminate the need for authors of functions to worry about explicitly writing anything other than the simple, single wide, version of functions. This leaves a compiler to create wider versions for efficiency by coalescing operations in order to match the width of SIMD instructions.
  • Support for reduction operations in a way that makes semantic sense. For instance, this can be done via an explicit way to have private copies of data to shadow global data that needs to be used in parallel tasks that are created (the number of which is not known to the program explicitly). Perhaps this starts to overlap with task parallelism…

For task parallelism:

  • spawn/join to spawn a function, and to wait for spawns to complete
  • parallel for, to specifically avoid the need to serially spawn individual loop body instances – and make it very clear that all iterations are ready to spawn concurrently.

Like other popular programming languages, neither C nor C++ was designed as parallel programming languages. Parallelism is always hidden from a compiler and needs “discovery.” Compilers are not good at “complex discovery” – they are much better at optimizing and packaging up things that are explicit. Explicit constructs for parallelism solve this and make compiler support more likely. The constructs do not need to be numerous, just enough for other constructs to build upon… fewer is better!

For something as involved, or complex, as parallelism, incorporating parallelism semantics into the programming language improves both the expressability of the language, as well as the efficiency by which the compiler can implement  parallelism.

Years of investigation and experimentation have had some great results. Compiler writers have found they can offer substantial benefits for ease of programming, performance, debugging and portability.  These have appeared in a variety of papers and talks over the years, and could be the topic of future blogs.

continue reading…